The US airline carrier 'Falcon Airlines' is facing a decrease in sales over the years, while the airline industry demand is positive growing. therefore, it was imperative that marketing department conducted a survey among 90917 individuals who travelled using the service of the airline, to determine the level of satisfaction based in the service provided, facilities and technology to deliver a better, safe and pleasant experience to the customer.
Hence, the company had established certain parameters which had been considered to play in the important role to understand the consumer demands now a days for better service and be able to identify ways to improve and innovate.
The project has 2 sources of data information, the flight data has information related to the passangers and the performance of the flights in which they travelled and the survey data is the information collected post service experience
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# To suppress the warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
plot_roc_curve,
)
%matplotlib inline
import seaborn as sns
# To impute missing values
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
# For preprocessing
from sklearn import preprocessing
# For Feature selection
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, Lasso, lars_path
# For model evaluation
import time
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# loading the dataset
df1 = pd.read_csv("Flight data.csv")
# loading the dataset
df2 = pd.read_csv("Surveydata.csv")
df1.shape
(90917, 9)
df2.shape
(90917, 16)
# viewing a random sample of the dataset
df1.sample(n=10, random_state=1)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | |
|---|---|---|---|---|---|---|---|---|---|
| 50679 | 200644 | Male | Loyal Customer | 57 | Business travel | Eco | 2174 | 30 | 36.000 |
| 33900 | 183865 | Female | disloyal Customer | 35 | Business travel | Business | 1739 | 1 | 7.000 |
| 5924 | 155889 | Male | Loyal Customer | 56 | Personal Travel | Eco | 1166 | 0 | 0.000 |
| 47760 | 197725 | Male | Loyal Customer | 58 | Business travel | Business | 2794 | 36 | 60.000 |
| 30312 | 180277 | Female | disloyal Customer | 21 | Business travel | Business | 2170 | 0 | 0.000 |
| 35857 | 185822 | Female | disloyal Customer | 25 | Business travel | Eco | 1868 | 21 | 12.000 |
| 9051 | 159016 | Male | Loyal Customer | 10 | Personal Travel | Eco | 6787 | 4 | 26.000 |
| 56968 | 206933 | Female | Loyal Customer | 56 | Business travel | Eco | 446 | 0 | 0.000 |
| 8993 | 158958 | Female | Loyal Customer | 37 | Personal Travel | Eco | 1960 | 18 | 0.000 |
| 71828 | 221793 | Male | Loyal Customer | 24 | Business travel | Business | 3761 | 11 | 0.000 |
# viewing a random sample of the dataset
df2.sample(n=10, random_state=1)
| Id | Satisfaction | Seat_comfort | Departure.Arrival.time_convenient | Food_drink | Gate_location | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50679 | 200644 | satisfied | need improvement | good | good | Convinient | need improvement | need improvement | need improvement | need improvement | acceptable | excellent | good | acceptable | poor | need improvement |
| 33900 | 183865 | neutral or dissatisfied | need improvement | need improvement | NaN | Convinient | poor | need improvement | poor | poor | acceptable | need improvement | good | acceptable | excellent | poor |
| 5924 | 155889 | neutral or dissatisfied | need improvement | poor | extremely poor | Convinient | acceptable | extremely poor | poor | acceptable | need improvement | excellent | good | need improvement | acceptable | acceptable |
| 47760 | 197725 | neutral or dissatisfied | poor | need improvement | need improvement | need improvement | need improvement | good | good | poor | poor | poor | poor | acceptable | poor | poor |
| 30312 | 180277 | neutral or dissatisfied | poor | poor | need improvement | Convinient | need improvement | need improvement | need improvement | need improvement | acceptable | good | acceptable | acceptable | acceptable | need improvement |
| 35857 | 185822 | neutral or dissatisfied | acceptable | poor | acceptable | manageable | excellent | acceptable | excellent | excellent | NaN | excellent | good | acceptable | acceptable | excellent |
| 9051 | 159016 | neutral or dissatisfied | need improvement | good | need improvement | need improvement | need improvement | good | good | excellent | acceptable | good | good | good | good | good |
| 56968 | 206933 | satisfied | acceptable | acceptable | acceptable | manageable | excellent | excellent | poor | acceptable | acceptable | acceptable | acceptable | good | acceptable | poor |
| 8993 | 158958 | satisfied | good | excellent | poor | very convinient | good | poor | excellent | good | good | good | good | good | good | acceptable |
| 71828 | 221793 | satisfied | excellent | excellent | NaN | very convinient | good | good | good | good | good | good | excellent | acceptable | good | good |
df2.rename(columns={"Id": "ID"}, inplace=True)
data = df1.merge(df2, on="ID", how="inner")
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | ... | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50679 | 200644 | Male | Loyal Customer | 57 | Business travel | Eco | 2174 | 30 | 36.000 | satisfied | ... | need improvement | need improvement | need improvement | need improvement | acceptable | excellent | good | acceptable | poor | need improvement |
| 33900 | 183865 | Female | disloyal Customer | 35 | Business travel | Business | 1739 | 1 | 7.000 | neutral or dissatisfied | ... | poor | need improvement | poor | poor | acceptable | need improvement | good | acceptable | excellent | poor |
| 5924 | 155889 | Male | Loyal Customer | 56 | Personal Travel | Eco | 1166 | 0 | 0.000 | neutral or dissatisfied | ... | acceptable | extremely poor | poor | acceptable | need improvement | excellent | good | need improvement | acceptable | acceptable |
| 47760 | 197725 | Male | Loyal Customer | 58 | Business travel | Business | 2794 | 36 | 60.000 | neutral or dissatisfied | ... | need improvement | good | good | poor | poor | poor | poor | acceptable | poor | poor |
| 30312 | 180277 | Female | disloyal Customer | 21 | Business travel | Business | 2170 | 0 | 0.000 | neutral or dissatisfied | ... | need improvement | need improvement | need improvement | need improvement | acceptable | good | acceptable | acceptable | acceptable | need improvement |
| 35857 | 185822 | Female | disloyal Customer | 25 | Business travel | Eco | 1868 | 21 | 12.000 | neutral or dissatisfied | ... | excellent | acceptable | excellent | excellent | NaN | excellent | good | acceptable | acceptable | excellent |
| 9051 | 159016 | Male | Loyal Customer | 10 | Personal Travel | Eco | 6787 | 4 | 26.000 | neutral or dissatisfied | ... | need improvement | good | good | excellent | acceptable | good | good | good | good | good |
| 56968 | 206933 | Female | Loyal Customer | 56 | Business travel | Eco | 446 | 0 | 0.000 | satisfied | ... | excellent | excellent | poor | acceptable | acceptable | acceptable | acceptable | good | acceptable | poor |
| 8993 | 158958 | Female | Loyal Customer | 37 | Personal Travel | Eco | 1960 | 18 | 0.000 | satisfied | ... | good | poor | excellent | good | good | good | good | good | good | acceptable |
| 71828 | 221793 | Male | Loyal Customer | 24 | Business travel | Business | 3761 | 11 | 0.000 | satisfied | ... | good | good | good | good | good | good | excellent | acceptable | good | good |
10 rows × 24 columns
data.shape
(90917, 24)
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 90917 entries, 0 to 90916 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 90917 non-null int64 1 Gender 90917 non-null object 2 CustomerType 81818 non-null object 3 Age 90917 non-null int64 4 TypeTravel 81829 non-null object 5 Class 90917 non-null object 6 Flight_Distance 90917 non-null int64 7 DepartureDelayin_Mins 90917 non-null int64 8 ArrivalDelayin_Mins 90633 non-null float64 9 Satisfaction 90917 non-null object 10 Seat_comfort 90917 non-null object 11 Departure.Arrival.time_convenient 82673 non-null object 12 Food_drink 82736 non-null object 13 Gate_location 90917 non-null object 14 Inflightwifi_service 90917 non-null object 15 Inflight_entertainment 90917 non-null object 16 Online_support 90917 non-null object 17 Ease_of_Onlinebooking 90917 non-null object 18 Onboard_service 83738 non-null object 19 Leg_room_service 90917 non-null object 20 Baggage_handling 90917 non-null object 21 Checkin_service 90917 non-null object 22 Cleanliness 90917 non-null object 23 Online_boarding 90917 non-null object dtypes: float64(1), int64(4), object(19) memory usage: 17.3+ MB
# let's check for duplicate values in the data
df.duplicated().sum()
0
# let's check for missing values in the data
pd.DataFrame(
data={"% of Missing Values": round(df.isna().sum() / df.isna().count() * 100, 2)}
)
| % of Missing Values | |
|---|---|
| ID | 0.000 |
| Gender | 0.000 |
| CustomerType | 10.010 |
| Age | 0.000 |
| TypeTravel | 10.000 |
| Class | 0.000 |
| Flight_Distance | 0.000 |
| DepartureDelayin_Mins | 0.000 |
| ArrivalDelayin_Mins | 0.310 |
| Satisfaction | 0.000 |
| Seat_comfort | 0.000 |
| Departure.Arrival.time_convenient | 9.070 |
| Food_drink | 9.000 |
| Gate_location | 0.000 |
| Inflightwifi_service | 0.000 |
| Inflight_entertainment | 0.000 |
| Online_support | 0.000 |
| Ease_of_Onlinebooking | 0.000 |
| Onboard_service | 7.900 |
| Leg_room_service | 0.000 |
| Baggage_handling | 0.000 |
| Checkin_service | 0.000 |
| Cleanliness | 0.000 |
| Online_boarding | 0.000 |
# Checking missing values per row
num_missing = df.isnull().sum(axis=1)
num_missing.value_counts()
0 53634 1 32503 2 4768 3 12 dtype: int64
df[num_missing == 3].sample(n=10)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | ... | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31790 | 181755 | Female | NaN | 24 | Business travel | Eco | 1844 | 23 | NaN | neutral or dissatisfied | ... | need improvement | need improvement | acceptable | need improvement | NaN | poor | need improvement | acceptable | acceptable | need improvement |
| 72591 | 222556 | Female | NaN | 37 | Business travel | Business | 4019 | 0 | NaN | neutral or dissatisfied | ... | need improvement | good | good | good | NaN | good | good | poor | good | good |
| 87450 | 237415 | Female | Loyal Customer | 50 | NaN | Business | 1753 | 19 | NaN | satisfied | ... | excellent | good | excellent | excellent | NaN | excellent | excellent | good | excellent | excellent |
| 85743 | 235708 | Male | Loyal Customer | 50 | NaN | Business | 676 | 4 | NaN | satisfied | ... | need improvement | excellent | good | excellent | excellent | excellent | excellent | excellent | excellent | acceptable |
| 56851 | 206816 | Male | Loyal Customer | 58 | NaN | Eco | 2009 | 55 | NaN | neutral or dissatisfied | ... | acceptable | acceptable | acceptable | acceptable | acceptable | poor | need improvement | good | need improvement | acceptable |
| 1629 | 151594 | Male | NaN | 8 | Personal Travel | Eco | 3632 | 0 | NaN | neutral or dissatisfied | ... | poor | extremely poor | poor | poor | NaN | good | acceptable | poor | acceptable | poor |
| 50518 | 200483 | Male | NaN | 60 | Business travel | Eco Plus | 1333 | 13 | NaN | neutral or dissatisfied | ... | need improvement | need improvement | need improvement | need improvement | acceptable | excellent | need improvement | acceptable | acceptable | need improvement |
| 59655 | 209620 | Male | Loyal Customer | 42 | NaN | Eco | 1526 | 73 | NaN | neutral or dissatisfied | ... | acceptable | acceptable | acceptable | acceptable | poor | need improvement | acceptable | poor | acceptable | acceptable |
| 28843 | 178808 | Female | disloyal Customer | 24 | NaN | Eco Plus | 1798 | 0 | NaN | neutral or dissatisfied | ... | good | poor | good | good | NaN | poor | good | acceptable | good | good |
| 72277 | 222242 | Female | Loyal Customer | 25 | NaN | Business | 2736 | 21 | NaN | satisfied | ... | good | good | good | good | excellent | excellent | excellent | acceptable | excellent | good |
10 rows × 24 columns
df[num_missing == 2].sample(n=10)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | ... | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15052 | 165017 | Male | NaN | 11 | Personal Travel | Eco | 2818 | 100 | 92.000 | neutral or dissatisfied | ... | poor | acceptable | poor | poor | need improvement | need improvement | poor | good | need improvement | poor |
| 6660 | 156625 | Male | Loyal Customer | 53 | NaN | Eco | 2006 | 22 | 54.000 | neutral or dissatisfied | ... | poor | poor | poor | poor | NaN | acceptable | acceptable | poor | need improvement | poor |
| 35964 | 185929 | Male | disloyal Customer | 36 | NaN | Eco | 1955 | 22 | 19.000 | neutral or dissatisfied | ... | acceptable | acceptable | acceptable | acceptable | NaN | poor | acceptable | need improvement | acceptable | acceptable |
| 87496 | 237461 | Female | NaN | 53 | Business travel | Business | 1624 | 3 | 8.000 | satisfied | ... | good | good | good | excellent | NaN | excellent | excellent | acceptable | excellent | acceptable |
| 48902 | 198867 | Male | Loyal Customer | 56 | NaN | Business | 1951 | 1 | 0.000 | neutral or dissatisfied | ... | need improvement | good | good | poor | poor | need improvement | poor | need improvement | poor | acceptable |
| 21963 | 171928 | Male | NaN | 60 | Personal Travel | Eco | 1630 | 7 | 0.000 | neutral or dissatisfied | ... | poor | acceptable | acceptable | poor | excellent | need improvement | need improvement | good | acceptable | poor |
| 42612 | 192577 | Female | NaN | 23 | Business travel | Eco | 1887 | 20 | 15.000 | satisfied | ... | excellent | excellent | excellent | excellent | NaN | excellent | excellent | acceptable | good | excellent |
| 23474 | 173439 | Male | Loyal Customer | 33 | NaN | Eco | 4354 | 0 | 0.000 | satisfied | ... | good | need improvement | need improvement | acceptable | excellent | need improvement | need improvement | need improvement | acceptable | need improvement |
| 12444 | 162409 | Male | NaN | 12 | Personal Travel | Eco | 2585 | 6 | 0.000 | neutral or dissatisfied | ... | good | need improvement | good | good | NaN | acceptable | excellent | good | excellent | good |
| 43551 | 193516 | Female | disloyal Customer | 25 | NaN | Business | 1792 | 0 | 0.000 | satisfied | ... | poor | excellent | poor | poor | NaN | good | good | excellent | excellent | poor |
10 rows × 24 columns
df[num_missing == 1].sample(n=10)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | ... | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 33861 | 183826 | Female | disloyal Customer | 36 | Business travel | Business | 2105 | 32 | 7.000 | neutral or dissatisfied | ... | need improvement | need improvement | need improvement | need improvement | good | good | good | acceptable | good | need improvement |
| 75342 | 225307 | Female | Loyal Customer | 60 | Business travel | Business | 921 | 5 | 14.000 | satisfied | ... | good | excellent | excellent | good | NaN | good | good | excellent | good | excellent |
| 46791 | 196756 | Female | NaN | 58 | Business travel | Business | 1977 | 0 | 0.000 | neutral or dissatisfied | ... | good | good | acceptable | poor | poor | poor | poor | need improvement | poor | need improvement |
| 82093 | 232058 | Male | Loyal Customer | 42 | Business travel | Business | 1920 | 36 | 22.000 | satisfied | ... | need improvement | acceptable | good | excellent | excellent | excellent | excellent | good | excellent | excellent |
| 50036 | 200001 | Male | Loyal Customer | 43 | Business travel | Eco | 1991 | 0 | 6.000 | neutral or dissatisfied | ... | need improvement | need improvement | need improvement | need improvement | NaN | good | poor | acceptable | need improvement | need improvement |
| 89190 | 239155 | Male | Loyal Customer | 52 | Business travel | Business | 353 | 10 | 13.000 | satisfied | ... | good | good | good | excellent | excellent | excellent | excellent | excellent | excellent | acceptable |
| 60690 | 210655 | Female | Loyal Customer | 50 | Business travel | Business | 1359 | 0 | 0.000 | neutral or dissatisfied | ... | excellent | acceptable | good | good | NaN | acceptable | good | acceptable | good | need improvement |
| 48582 | 198547 | Male | Loyal Customer | 18 | NaN | Eco | 2035 | 26 | 39.000 | neutral or dissatisfied | ... | poor | poor | acceptable | poor | acceptable | acceptable | good | need improvement | acceptable | poor |
| 76006 | 225971 | Male | Loyal Customer | 42 | Business travel | Business | 406 | 0 | 5.000 | satisfied | ... | acceptable | excellent | good | good | good | good | good | good | good | excellent |
| 90742 | 240707 | Female | NaN | 58 | Business travel | Business | 3857 | 0 | 0.000 | satisfied | ... | excellent | good | good | excellent | excellent | excellent | excellent | good | excellent | excellent |
10 rows × 24 columns
# checking the number of unique values in each column
df.nunique()
ID 90917 Gender 2 CustomerType 2 Age 75 TypeTravel 2 Class 3 Flight_Distance 5213 DepartureDelayin_Mins 436 ArrivalDelayin_Mins 445 Satisfaction 2 Seat_comfort 6 Departure.Arrival.time_convenient 6 Food_drink 6 Gate_location 6 Inflightwifi_service 6 Inflight_entertainment 6 Online_support 6 Ease_of_Onlinebooking 6 Onboard_service 6 Leg_room_service 6 Baggage_handling 5 Checkin_service 6 Cleanliness 6 Online_boarding 6 dtype: int64
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 90917.000 | 195423.000 | 26245.622 | 149965.000 | 172694.000 | 195423.000 | 218152.000 | 240881.000 |
| Age | 90917.000 | 39.447 | 15.130 | 7.000 | 27.000 | 40.000 | 51.000 | 85.000 |
| Flight_Distance | 90917.000 | 1981.629 | 1026.780 | 50.000 | 1360.000 | 1927.000 | 2542.000 | 6950.000 |
| DepartureDelayin_Mins | 90917.000 | 14.687 | 38.669 | 0.000 | 0.000 | 0.000 | 12.000 | 1592.000 |
| ArrivalDelayin_Mins | 90633.000 | 15.059 | 39.039 | 0.000 | 0.000 | 0.000 | 13.000 | 1584.000 |
df.describe(include=["object"]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Gender | 90917 | 2 | Female | 46186 |
| CustomerType | 81818 | 2 | Loyal Customer | 66897 |
| TypeTravel | 81829 | 2 | Business travel | 56481 |
| Class | 90917 | 3 | Business | 43535 |
| Satisfaction | 90917 | 2 | satisfied | 49761 |
| Seat_comfort | 90917 | 6 | acceptable | 20552 |
| Departure.Arrival.time_convenient | 82673 | 6 | good | 18840 |
| Food_drink | 82736 | 6 | acceptable | 17991 |
| Gate_location | 90917 | 6 | manageable | 23385 |
| Inflightwifi_service | 90917 | 6 | good | 22159 |
| Inflight_entertainment | 90917 | 6 | good | 29373 |
| Online_support | 90917 | 6 | good | 29042 |
| Ease_of_Onlinebooking | 90917 | 6 | good | 27993 |
| Onboard_service | 83738 | 6 | good | 26373 |
| Leg_room_service | 90917 | 6 | good | 27814 |
| Baggage_handling | 90917 | 5 | good | 33822 |
| Checkin_service | 90917 | 6 | good | 25483 |
| Cleanliness | 90917 | 6 | good | 34246 |
| Online_boarding | 90917 | 6 | good | 24676 |
for i in df.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(df[i].value_counts())
print("*" * 50)
Unique values in Gender are : Female 46186 Male 44731 Name: Gender, dtype: int64 ************************************************** Unique values in CustomerType are : Loyal Customer 66897 disloyal Customer 14921 Name: CustomerType, dtype: int64 ************************************************** Unique values in TypeTravel are : Business travel 56481 Personal Travel 25348 Name: TypeTravel, dtype: int64 ************************************************** Unique values in Class are : Business 43535 Eco 40758 Eco Plus 6624 Name: Class, dtype: int64 ************************************************** Unique values in Satisfaction are : satisfied 49761 neutral or dissatisfied 41156 Name: Satisfaction, dtype: int64 ************************************************** Unique values in Seat_comfort are : acceptable 20552 need improvement 20002 good 19789 poor 14687 excellent 12519 extremely poor 3368 Name: Seat_comfort, dtype: int64 ************************************************** Unique values in Departure.Arrival.time_convenient are : good 18840 excellent 17079 acceptable 14806 need improvement 14539 poor 13210 extremely poor 4199 Name: Departure.Arrival.time_convenient, dtype: int64 ************************************************** Unique values in Food_drink are : acceptable 17991 need improvement 17359 good 17245 poor 13400 excellent 12947 extremely poor 3794 Name: Food_drink, dtype: int64 ************************************************** Unique values in Gate_location are : manageable 23385 Convinient 21088 need improvement 17113 Inconvinient 15876 very convinient 13454 very inconvinient 1 Name: Gate_location, dtype: int64 ************************************************** Unique values in Inflightwifi_service are : good 22159 excellent 20258 acceptable 19199 need improvement 18894 poor 10311 extremely poor 96 Name: Inflightwifi_service, dtype: int64 ************************************************** Unique values in Inflight_entertainment are : good 29373 excellent 20786 acceptable 16995 need improvement 13527 poor 8198 extremely poor 2038 Name: Inflight_entertainment, dtype: int64 ************************************************** Unique values in Online_support are : good 29042 excellent 24916 acceptable 15090 need improvement 12063 poor 9805 extremely poor 1 Name: Online_support, dtype: int64 ************************************************** Unique values in Ease_of_Onlinebooking are : good 27993 excellent 23960 acceptable 15686 need improvement 13896 poor 9370 extremely poor 12 Name: Ease_of_Onlinebooking, dtype: int64 ************************************************** Unique values in Onboard_service are : good 26373 excellent 20396 acceptable 17411 need improvement 11018 poor 8537 extremely poor 3 Name: Onboard_service, dtype: int64 ************************************************** Unique values in Leg_room_service are : good 27814 excellent 24071 acceptable 15775 need improvement 15156 poor 7779 extremely poor 322 Name: Leg_room_service, dtype: int64 ************************************************** Unique values in Baggage_handling are : good 33822 excellent 25002 acceptable 17233 need improvement 9301 poor 5559 Name: Baggage_handling, dtype: int64 ************************************************** Unique values in Checkin_service are : good 25483 acceptable 24941 excellent 18918 need improvement 10813 poor 10761 extremely poor 1 Name: Checkin_service, dtype: int64 ************************************************** Unique values in Cleanliness are : good 34246 excellent 25079 acceptable 16930 need improvement 9283 poor 5375 extremely poor 4 Name: Cleanliness, dtype: int64 ************************************************** Unique values in Online_boarding are : good 24676 acceptable 21427 excellent 20993 need improvement 13035 poor 10777 extremely poor 9 Name: Online_boarding, dtype: int64 **************************************************
The survey parameteres and binary target needs to be encode for easier syntax.
## Encoding the survay data
df["Satisfaction"].replace("neutral or dissatisfied", 0, inplace=True)
df["Satisfaction"].replace("satisfied", 1, inplace=True)
df["Seat_comfort"].replace("extremely poor", 0, inplace=True)
df["Seat_comfort"].replace("poor", 1, inplace=True)
df["Seat_comfort"].replace("need improvement", 2, inplace=True)
df["Seat_comfort"].replace("acceptable", 3, inplace=True)
df["Seat_comfort"].replace("good", 4, inplace=True)
df["Seat_comfort"].replace("excellent", 5, inplace=True)
df["Departure.Arrival.time_convenient"].replace("extremely poor", 0, inplace=True)
df["Departure.Arrival.time_convenient"].replace("poor", 1, inplace=True)
df["Departure.Arrival.time_convenient"].replace("need improvement", 2, inplace=True)
df["Departure.Arrival.time_convenient"].replace("acceptable", 3, inplace=True)
df["Departure.Arrival.time_convenient"].replace("good", 4, inplace=True)
df["Departure.Arrival.time_convenient"].replace("excellent", 5, inplace=True)
df["Food_drink"].replace("extremely poor", 0, inplace=True)
df["Food_drink"].replace("poor", 1, inplace=True)
df["Food_drink"].replace("need improvement", 2, inplace=True)
df["Food_drink"].replace("acceptable", 3, inplace=True)
df["Food_drink"].replace("good", 4, inplace=True)
df["Food_drink"].replace("excellent", 5, inplace=True)
df["Gate_location"].replace("very inconvinient", 0, inplace=True)
df["Gate_location"].replace("Inconvinient", 1, inplace=True)
df["Gate_location"].replace("need improvement", 2, inplace=True)
df["Gate_location"].replace("manageable", 3, inplace=True)
df["Gate_location"].replace("Convinient", 4, inplace=True)
df["Gate_location"].replace("very convinient", 5, inplace=True)
df["Inflightwifi_service"].replace("extremely poor", 0, inplace=True)
df["Inflightwifi_service"].replace("poor", 1, inplace=True)
df["Inflightwifi_service"].replace("need improvement", 2, inplace=True)
df["Inflightwifi_service"].replace("acceptable", 3, inplace=True)
df["Inflightwifi_service"].replace("good", 4, inplace=True)
df["Inflightwifi_service"].replace("excellent", 5, inplace=True)
df["Inflight_entertainment"].replace("extremely poor", 0, inplace=True)
df["Inflight_entertainment"].replace("poor", 1, inplace=True)
df["Inflight_entertainment"].replace("need improvement", 2, inplace=True)
df["Inflight_entertainment"].replace("acceptable", 3, inplace=True)
df["Inflight_entertainment"].replace("good", 4, inplace=True)
df["Inflight_entertainment"].replace("excellent", 5, inplace=True)
df["Online_support"].replace("extremely poor", 0, inplace=True)
df["Online_support"].replace("poor", 1, inplace=True)
df["Online_support"].replace("need improvement", 2, inplace=True)
df["Online_support"].replace("acceptable", 3, inplace=True)
df["Online_support"].replace("good", 4, inplace=True)
df["Online_support"].replace("excellent", 5, inplace=True)
df["Ease_of_Onlinebooking"].replace("extremely poor", 0, inplace=True)
df["Ease_of_Onlinebooking"].replace("poor", 1, inplace=True)
df["Ease_of_Onlinebooking"].replace("need improvement", 2, inplace=True)
df["Ease_of_Onlinebooking"].replace("acceptable", 3, inplace=True)
df["Ease_of_Onlinebooking"].replace("good", 4, inplace=True)
df["Ease_of_Onlinebooking"].replace("excellent", 5, inplace=True)
df["Onboard_service"].replace("extremely poor", 0, inplace=True)
df["Onboard_service"].replace("poor", 1, inplace=True)
df["Onboard_service"].replace("need improvement", 2, inplace=True)
df["Onboard_service"].replace("acceptable", 3, inplace=True)
df["Onboard_service"].replace("good", 4, inplace=True)
df["Onboard_service"].replace("excellent", 5, inplace=True)
df["Leg_room_service"].replace("extremely poor", 0, inplace=True)
df["Leg_room_service"].replace("poor", 1, inplace=True)
df["Leg_room_service"].replace("need improvement", 2, inplace=True)
df["Leg_room_service"].replace("acceptable", 3, inplace=True)
df["Leg_room_service"].replace("good", 4, inplace=True)
df["Leg_room_service"].replace("excellent", 5, inplace=True)
df["Checkin_service"].replace("extremely poor", 0, inplace=True)
df["Checkin_service"].replace("poor", 1, inplace=True)
df["Checkin_service"].replace("need improvement", 2, inplace=True)
df["Checkin_service"].replace("acceptable", 3, inplace=True)
df["Checkin_service"].replace("good", 4, inplace=True)
df["Checkin_service"].replace("excellent", 5, inplace=True)
df["Cleanliness"].replace("extremely poor", 0, inplace=True)
df["Cleanliness"].replace("poor", 1, inplace=True)
df["Cleanliness"].replace("need improvement", 2, inplace=True)
df["Cleanliness"].replace("acceptable", 3, inplace=True)
df["Cleanliness"].replace("good", 4, inplace=True)
df["Cleanliness"].replace("excellent", 5, inplace=True)
df["Online_boarding"].replace("extremely poor", 0, inplace=True)
df["Online_boarding"].replace("poor", 1, inplace=True)
df["Online_boarding"].replace("need improvement", 2, inplace=True)
df["Online_boarding"].replace("acceptable", 3, inplace=True)
df["Online_boarding"].replace("good", 4, inplace=True)
df["Online_boarding"].replace("excellent", 5, inplace=True)
df["Baggage_handling"].replace("poor", 1, inplace=True)
df["Baggage_handling"].replace("need improvement", 2, inplace=True)
df["Baggage_handling"].replace("acceptable", 3, inplace=True)
df["Baggage_handling"].replace("good", 4, inplace=True)
df["Baggage_handling"].replace("excellent", 5, inplace=True)
# viewing a random sample of the dataset
df.sample(n=10, random_state=1)
| ID | Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | ... | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50679 | 200644 | Male | Loyal Customer | 57 | Business travel | Eco | 2174 | 30 | 36.000 | 1 | ... | 2 | 2 | 2 | 2 | 3.000 | 5 | 4 | 3 | 1 | 2 |
| 33900 | 183865 | Female | disloyal Customer | 35 | Business travel | Business | 1739 | 1 | 7.000 | 0 | ... | 1 | 2 | 1 | 1 | 3.000 | 2 | 4 | 3 | 5 | 1 |
| 5924 | 155889 | Male | Loyal Customer | 56 | Personal Travel | Eco | 1166 | 0 | 0.000 | 0 | ... | 3 | 0 | 1 | 3 | 2.000 | 5 | 4 | 2 | 3 | 3 |
| 47760 | 197725 | Male | Loyal Customer | 58 | Business travel | Business | 2794 | 36 | 60.000 | 0 | ... | 2 | 4 | 4 | 1 | 1.000 | 1 | 1 | 3 | 1 | 1 |
| 30312 | 180277 | Female | disloyal Customer | 21 | Business travel | Business | 2170 | 0 | 0.000 | 0 | ... | 2 | 2 | 2 | 2 | 3.000 | 4 | 3 | 3 | 3 | 2 |
| 35857 | 185822 | Female | disloyal Customer | 25 | Business travel | Eco | 1868 | 21 | 12.000 | 0 | ... | 5 | 3 | 5 | 5 | NaN | 5 | 4 | 3 | 3 | 5 |
| 9051 | 159016 | Male | Loyal Customer | 10 | Personal Travel | Eco | 6787 | 4 | 26.000 | 0 | ... | 2 | 4 | 4 | 5 | 3.000 | 4 | 4 | 4 | 4 | 4 |
| 56968 | 206933 | Female | Loyal Customer | 56 | Business travel | Eco | 446 | 0 | 0.000 | 1 | ... | 5 | 5 | 1 | 3 | 3.000 | 3 | 3 | 4 | 3 | 1 |
| 8993 | 158958 | Female | Loyal Customer | 37 | Personal Travel | Eco | 1960 | 18 | 0.000 | 1 | ... | 4 | 1 | 5 | 4 | 4.000 | 4 | 4 | 4 | 4 | 3 |
| 71828 | 221793 | Male | Loyal Customer | 24 | Business travel | Business | 3761 | 11 | 0.000 | 1 | ... | 4 | 4 | 4 | 4 | 4.000 | 4 | 5 | 3 | 4 | 4 |
10 rows × 24 columns
# ID consists of uniques ID for clients and hence will not add value to the modeling
df.drop(["ID"], axis=1, inplace=True)
# Making a list of all catrgorical variables
cat_col = [
"Gender",
"CustomerType",
"TypeTravel",
"Class",
]
# Converting the data type of each categorical variable to 'category'
for column in cat_col:
df[column] = df[column].astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 90917 entries, 0 to 90916 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 90917 non-null category 1 CustomerType 81818 non-null category 2 Age 90917 non-null int64 3 TypeTravel 81829 non-null category 4 Class 90917 non-null category 5 Flight_Distance 90917 non-null int64 6 DepartureDelayin_Mins 90917 non-null int64 7 ArrivalDelayin_Mins 90633 non-null float64 8 Satisfaction 90917 non-null int64 9 Seat_comfort 90917 non-null int64 10 Departure.Arrival.time_convenient 82673 non-null float64 11 Food_drink 82736 non-null float64 12 Gate_location 90917 non-null int64 13 Inflightwifi_service 90917 non-null int64 14 Inflight_entertainment 90917 non-null int64 15 Online_support 90917 non-null int64 16 Ease_of_Onlinebooking 90917 non-null int64 17 Onboard_service 83738 non-null float64 18 Leg_room_service 90917 non-null int64 19 Baggage_handling 90917 non-null int64 20 Checkin_service 90917 non-null int64 21 Cleanliness 90917 non-null int64 22 Online_boarding 90917 non-null int64 dtypes: category(4), float64(4), int64(15) memory usage: 14.2 MB
The data columns will be splitted according to identify variables that are monitored by the company, and which are controlled by the business. so the parameters will be splitted in 2.
Variables that wants to be controlled to retain customers:
Customers Information
And Variables controlled by the business which have an impact in to the company profits:
Facilities Satisfaction
Online service Satisfaction
In Flight Satisfaction
These features metioned above belong to the survay entries where passangers rate their flight experience on a sclae 0 to 5
### Create a new column based on the facility average
df["Facilities_avg"]= df[["Gate_location", "Onboard_service", "Baggage_handling", "Checkin_service"]].mean(axis=1)
### Create a new column based on the online service average
df["Online_service_avg"] = df[
["Online_support", "Ease_of_Onlinebooking", "Online_boarding"]
].mean(axis=1)
### Create a new column based on the Inflight average
df["InFlight_avg"] = df[
[
"Seat_comfort",
"Departure.Arrival.time_convenient",
"Food_drink",
"Inflightwifi_service",
"Inflight_entertainment",
"Leg_room_service",
"Cleanliness",
]
].mean(axis=1)
# round the aveage results
df.Facilities_avg = df.Facilities_avg.round()
df.Online_service_avg = df.Online_service_avg.round()
df.InFlight_avg = df.InFlight_avg.round()
# viewing a random sample of the sumary dataset
df[
[
"Gender",
"CustomerType",
"Age",
"TypeTravel",
"Class",
"Flight_Distance",
"DepartureDelayin_Mins",
"ArrivalDelayin_Mins",
"Satisfaction",
"Facilities_avg",
"Online_service_avg",
"InFlight_avg",
]
].sample(n=10, random_state=1)
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50679 | Male | Loyal Customer | 57 | Business travel | Eco | 2174 | 30 | 36.000 | 1 | 4.000 | 2.000 | 3.000 |
| 33900 | Female | disloyal Customer | 35 | Business travel | Business | 1739 | 1 | 7.000 | 0 | 4.000 | 1.000 | 2.000 |
| 5924 | Male | Loyal Customer | 56 | Personal Travel | Eco | 1166 | 0 | 0.000 | 0 | 3.000 | 2.000 | 2.000 |
| 47760 | Male | Loyal Customer | 58 | Business travel | Business | 2794 | 36 | 60.000 | 0 | 2.000 | 2.000 | 2.000 |
| 30312 | Female | disloyal Customer | 21 | Business travel | Business | 2170 | 0 | 0.000 | 0 | 3.000 | 2.000 | 2.000 |
| 35857 | Female | disloyal Customer | 25 | Business travel | Eco | 1868 | 21 | 12.000 | 0 | 3.000 | 5.000 | 3.000 |
| 9051 | Male | Loyal Customer | 10 | Personal Travel | Eco | 6787 | 4 | 26.000 | 0 | 3.000 | 4.000 | 3.000 |
| 56968 | Female | Loyal Customer | 56 | Business travel | Eco | 446 | 0 | 0.000 | 1 | 3.000 | 2.000 | 4.000 |
| 8993 | Female | Loyal Customer | 37 | Personal Travel | Eco | 1960 | 18 | 0.000 | 1 | 4.000 | 4.000 | 3.000 |
| 71828 | Male | Loyal Customer | 24 | Business travel | Business | 3761 | 11 | 0.000 | 1 | 4.000 | 4.000 | 4.000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 90917 entries, 0 to 90916 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 90917 non-null category 1 CustomerType 81818 non-null category 2 Age 90917 non-null int64 3 TypeTravel 81829 non-null category 4 Class 90917 non-null category 5 Flight_Distance 90917 non-null int64 6 DepartureDelayin_Mins 90917 non-null int64 7 ArrivalDelayin_Mins 90633 non-null float64 8 Satisfaction 90917 non-null int64 9 Seat_comfort 90917 non-null int64 10 Departure.Arrival.time_convenient 82673 non-null float64 11 Food_drink 82736 non-null float64 12 Gate_location 90917 non-null int64 13 Inflightwifi_service 90917 non-null int64 14 Inflight_entertainment 90917 non-null int64 15 Online_support 90917 non-null int64 16 Ease_of_Onlinebooking 90917 non-null int64 17 Onboard_service 83738 non-null float64 18 Leg_room_service 90917 non-null int64 19 Baggage_handling 90917 non-null int64 20 Checkin_service 90917 non-null int64 21 Cleanliness 90917 non-null int64 22 Online_boarding 90917 non-null int64 23 Facilities_avg 90917 non-null float64 24 Online_service_avg 90917 non-null float64 25 InFlight_avg 90917 non-null float64 dtypes: category(4), float64(7), int64(15) memory usage: 16.3 MB
df[
[
"Facilities_avg",
"Gate_location",
"Onboard_service",
"Baggage_handling",
"Checkin_service",
"Satisfaction",
]
].describe()
| Facilities_avg | Gate_location | Onboard_service | Baggage_handling | Checkin_service | Satisfaction | |
|---|---|---|---|---|---|---|
| count | 90917.000 | 90917.000 | 83738.000 | 90917.000 | 90917.000 | 90917.000 |
| mean | 3.376 | 2.990 | 3.467 | 3.697 | 3.341 | 0.547 |
| std | 0.816 | 1.308 | 1.269 | 1.154 | 1.261 | 0.498 |
| min | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| 25% | 3.000 | 2.000 | 3.000 | 3.000 | 3.000 | 0.000 |
| 50% | 4.000 | 3.000 | 4.000 | 4.000 | 3.000 | 1.000 |
| 75% | 4.000 | 4.000 | 4.000 | 5.000 | 4.000 | 1.000 |
| max | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 1.000 |
df[
[
"Online_service_avg",
"Online_support",
"Ease_of_Onlinebooking",
"Online_boarding",
"Satisfaction",
]
].describe()
| Online_service_avg | Online_support | Ease_of_Onlinebooking | Online_boarding | Satisfaction | |
|---|---|---|---|---|---|
| count | 90917.000 | 90917.000 | 90917.000 | 90917.000 | 90917.000 |
| mean | 3.449 | 3.519 | 3.476 | 3.352 | 0.547 |
| std | 1.165 | 1.308 | 1.305 | 1.300 | 0.498 |
| min | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | 3.000 | 3.000 | 2.000 | 2.000 | 0.000 |
| 50% | 4.000 | 4.000 | 4.000 | 4.000 | 1.000 |
| 75% | 4.000 | 5.000 | 5.000 | 4.000 | 1.000 |
| max | 5.000 | 5.000 | 5.000 | 5.000 | 1.000 |
df[
[
"InFlight_avg",
"Seat_comfort",
"Departure.Arrival.time_convenient",
"Food_drink",
"Inflightwifi_service",
"Inflight_entertainment",
"Leg_room_service",
"Cleanliness",
"Satisfaction",
]
].describe()
| InFlight_avg | Seat_comfort | Departure.Arrival.time_convenient | Food_drink | Inflightwifi_service | Inflight_entertainment | Leg_room_service | Cleanliness | Satisfaction | |
|---|---|---|---|---|---|---|---|---|---|
| count | 90917.000 | 90917.000 | 82673.000 | 82736.000 | 90917.000 | 90917.000 | 90917.000 | 90917.000 | 90917.000 |
| mean | 3.227 | 2.839 | 2.993 | 2.850 | 3.252 | 3.384 | 3.487 | 3.708 | 0.547 |
| std | 0.816 | 1.394 | 1.525 | 1.443 | 1.320 | 1.342 | 1.292 | 1.148 | 0.498 |
| min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | 3.000 | 2.000 | 2.000 | 2.000 | 2.000 | 2.000 | 2.000 | 3.000 | 0.000 |
| 50% | 3.000 | 3.000 | 3.000 | 3.000 | 3.000 | 4.000 | 4.000 | 4.000 | 1.000 |
| 75% | 4.000 | 4.000 | 4.000 | 4.000 | 4.000 | 4.000 | 5.000 | 5.000 | 1.000 |
| max | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 1.000 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(df, "Age")
df[df["Age"] < 15]
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Male | Loyal Customer | 10 | Personal Travel | Eco | 1812 | 0 | 0.000 | 1 | 0 | ... | 2 | 3.000 | 3 | 4 | 5 | 4 | 2 | 4.000 | 2.000 | 2.000 |
| 13 | Female | Loyal Customer | 13 | Personal Travel | Eco | 3693 | 5 | 0.000 | 1 | 0 | ... | 4 | 4.000 | 4 | 1 | 3 | 1 | 4 | 2.000 | 4.000 | 1.000 |
| 16 | Female | Loyal Customer | 9 | Personal Travel | Eco | 3305 | 0 | 0.000 | 1 | 0 | ... | 3 | 1.000 | 1 | 1 | 3 | 3 | 3 | 2.000 | 4.000 | 1.000 |
| 17 | Female | Loyal Customer | 10 | Personal Travel | Eco | 2090 | 0 | 0.000 | 1 | 0 | ... | 1 | 3.000 | 5 | 1 | 4 | 2 | 1 | 2.000 | 1.000 | 1.000 |
| 24 | Male | Loyal Customer | 9 | Personal Travel | Eco | 972 | 0 | 0.000 | 1 | 0 | ... | 4 | 4.000 | 3 | 3 | 1 | 3 | 4 | 2.000 | 4.000 | 2.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 90881 | Female | disloyal Customer | 12 | Personal Travel | Eco | 2429 | 0 | 3.000 | 0 | 3 | ... | 1 | 2.000 | 4 | 1 | 3 | 4 | 1 | 2.000 | 1.000 | 3.000 |
| 90893 | Female | disloyal Customer | 7 | Personal Travel | Eco | 1616 | 0 | 15.000 | 0 | 3 | ... | 3 | NaN | 1 | 4 | 3 | 1 | 3 | 3.000 | 3.000 | 3.000 |
| 90900 | Female | disloyal Customer | 14 | Personal Travel | Business | 1966 | 0 | 0.000 | 0 | 3 | ... | 1 | 5.000 | 5 | 4 | 4 | 4 | 1 | 4.000 | 1.000 | 3.000 |
| 90901 | Female | disloyal Customer | 14 | Personal Travel | Eco | 1972 | 0 | 0.000 | 0 | 3 | ... | 5 | 3.000 | 4 | 5 | 4 | 4 | 5 | 4.000 | 5.000 | 4.000 |
| 90912 | Female | disloyal Customer | 11 | Personal Travel | Eco | 2752 | 5 | 0.000 | 1 | 5 | ... | 2 | 3.000 | 5 | 3 | 5 | 4 | 2 | 3.000 | 2.000 | 4.000 |
4488 rows × 26 columns
df.loc[(df.Age < 15), "Age"] = np.nan
# check if missing values for Age are in place
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 90917 entries, 0 to 90916 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 90917 non-null category 1 CustomerType 81818 non-null category 2 Age 86429 non-null float64 3 TypeTravel 81829 non-null category 4 Class 90917 non-null category 5 Flight_Distance 90917 non-null int64 6 DepartureDelayin_Mins 90917 non-null int64 7 ArrivalDelayin_Mins 90633 non-null float64 8 Satisfaction 90917 non-null int64 9 Seat_comfort 90917 non-null int64 10 Departure.Arrival.time_convenient 82673 non-null float64 11 Food_drink 82736 non-null float64 12 Gate_location 90917 non-null int64 13 Inflightwifi_service 90917 non-null int64 14 Inflight_entertainment 90917 non-null int64 15 Online_support 90917 non-null int64 16 Ease_of_Onlinebooking 90917 non-null int64 17 Onboard_service 83738 non-null float64 18 Leg_room_service 90917 non-null int64 19 Baggage_handling 90917 non-null int64 20 Checkin_service 90917 non-null int64 21 Cleanliness 90917 non-null int64 22 Online_boarding 90917 non-null int64 23 Facilities_avg 90917 non-null float64 24 Online_service_avg 90917 non-null float64 25 InFlight_avg 90917 non-null float64 dtypes: category(4), float64(8), int64(14) memory usage: 16.3 MB
histogram_boxplot(df, "Flight_Distance")
df[df["Flight_Distance"] > 4400]
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 105 | Female | Loyal Customer | 21.000 | Personal Travel | Eco | 4804 | 20 | NaN | 1 | 0 | ... | 5 | NaN | 1 | 4 | 2 | 3 | 5 | 3.000 | 5.000 | 2.000 |
| 400 | Female | Loyal Customer | 63.000 | Personal Travel | Business | 6591 | 3 | 0.000 | 1 | 0 | ... | 3 | 2.000 | 4 | 3 | 1 | 5 | 1 | 2.000 | 2.000 | 3.000 |
| 801 | Male | Loyal Customer | 35.000 | Personal Travel | Eco | 6470 | 0 | 32.000 | 0 | 1 | ... | 1 | 2.000 | 2 | 2 | 3 | 2 | 3 | 2.000 | 2.000 | 2.000 |
| 802 | Male | Loyal Customer | 23.000 | Personal Travel | Eco | 4650 | 0 | NaN | 0 | 1 | ... | 2 | 1.000 | 3 | 4 | 4 | 4 | 2 | 3.000 | 2.000 | 2.000 |
| 1156 | Male | Loyal Customer | 21.000 | Personal Travel | Eco | 5127 | 2 | 2.000 | 0 | 1 | ... | 3 | 5.000 | 1 | 2 | 2 | 2 | 2 | 3.000 | 2.000 | 1.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 90432 | Female | NaN | 31.000 | Business travel | Business | 4889 | 19 | 14.000 | 1 | 1 | ... | 5 | 5.000 | 3 | 5 | 4 | 4 | 5 | 4.000 | 5.000 | 3.000 |
| 90548 | Male | Loyal Customer | 34.000 | Business travel | Eco | 5832 | 4 | 1.000 | 1 | 5 | ... | 4 | 5.000 | 2 | 5 | 5 | 5 | 5 | 5.000 | 5.000 | 4.000 |
| 90672 | Female | Loyal Customer | 22.000 | NaN | Business | 4652 | 0 | 0.000 | 1 | 3 | ... | 5 | 4.000 | 3 | 4 | 3 | 5 | 5 | 4.000 | 5.000 | 4.000 |
| 90710 | Female | NaN | 42.000 | Business travel | Business | 5403 | 189 | 153.000 | 1 | 1 | ... | 5 | 5.000 | 5 | 4 | 5 | 4 | 5 | 4.000 | 5.000 | 3.000 |
| 90843 | Female | disloyal Customer | 15.000 | Personal Travel | Eco | 4522 | 0 | 0.000 | 0 | 1 | ... | 1 | 5.000 | 3 | 5 | 5 | 5 | 1 | 5.000 | 1.000 | 2.000 |
1623 rows × 26 columns
histogram_boxplot(df, "DepartureDelayin_Mins")
df[df["DepartureDelayin_Mins"] > 35]
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Female | Loyal Customer | 58.000 | Personal Travel | Eco | 104 | 47 | 48.000 | 1 | 0 | ... | 3 | 3.000 | 0 | 1 | 2 | 3 | 5 | 2.000 | 4.000 | 1.000 |
| 11 | Female | Loyal Customer | 47.000 | Personal Travel | Eco | 84 | 40 | 48.000 | 1 | 0 | ... | 5 | 5.000 | 0 | 5 | 2 | 5 | 2 | 3.000 | 3.000 | 2.000 |
| 36 | Female | Loyal Customer | 17.000 | Personal Travel | Eco | 2748 | 427 | 440.000 | 1 | 0 | ... | 3 | 1.000 | 3 | 4 | 4 | 1 | 4 | 3.000 | 4.000 | 1.000 |
| 110 | Female | Loyal Customer | 55.000 | Personal Travel | Eco | 994 | 68 | 80.000 | 1 | 0 | ... | 3 | 3.000 | 0 | 3 | 1 | 3 | 3 | 3.000 | 3.000 | 2.000 |
| 128 | Male | Loyal Customer | 24.000 | Personal Travel | Eco | 3219 | 93 | 86.000 | 1 | 0 | ... | 3 | 4.000 | 2 | 5 | 5 | 5 | 3 | 4.000 | 3.000 | 2.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 90892 | Female | disloyal Customer | 54.000 | Personal Travel | Eco | 2928 | 41 | 41.000 | 0 | 3 | ... | 1 | 4.000 | 2 | 1 | 2 | 1 | 1 | 2.000 | 1.000 | 3.000 |
| 90910 | Female | NaN | 70.000 | Personal Travel | Eco | 1674 | 54 | 46.000 | 1 | 5 | ... | 5 | 3.000 | 2 | 4 | 5 | 4 | 5 | 3.000 | 5.000 | 4.000 |
| 90914 | Male | disloyal Customer | 69.000 | Personal Travel | Eco | 2320 | 155 | 163.000 | 0 | 3 | ... | 4 | 4.000 | 3 | 4 | 2 | 3 | 2 | 3.000 | 3.000 | 3.000 |
| 90915 | Male | disloyal Customer | 66.000 | Personal Travel | Eco | 2450 | 193 | 205.000 | 0 | 3 | ... | 3 | 3.000 | 2 | 3 | 2 | 1 | 2 | 2.000 | 2.000 | 2.000 |
| 90916 | Female | disloyal Customer | 38.000 | Personal Travel | Eco | 4307 | 185 | 186.000 | 0 | 3 | ... | 4 | 5.000 | 5 | 5 | 3 | 3 | 3 | 4.000 | 3.000 | 3.000 |
11053 rows × 26 columns
histogram_boxplot(df, "ArrivalDelayin_Mins")
df[df["ArrivalDelayin_Mins"] > 35]
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Female | Loyal Customer | 58.000 | Personal Travel | Eco | 104 | 47 | 48.000 | 1 | 0 | ... | 3 | 3.000 | 0 | 1 | 2 | 3 | 5 | 2.000 | 4.000 | 1.000 |
| 11 | Female | Loyal Customer | 47.000 | Personal Travel | Eco | 84 | 40 | 48.000 | 1 | 0 | ... | 5 | 5.000 | 0 | 5 | 2 | 5 | 2 | 3.000 | 3.000 | 2.000 |
| 36 | Female | Loyal Customer | 17.000 | Personal Travel | Eco | 2748 | 427 | 440.000 | 1 | 0 | ... | 3 | 1.000 | 3 | 4 | 4 | 1 | 4 | 3.000 | 4.000 | 1.000 |
| 61 | Female | NaN | 67.000 | Personal Travel | Eco | 465 | 35 | 69.000 | 1 | 0 | ... | 4 | 4.000 | 0 | 2 | 1 | 4 | 3 | 3.000 | 4.000 | 1.000 |
| 110 | Female | Loyal Customer | 55.000 | Personal Travel | Eco | 994 | 68 | 80.000 | 1 | 0 | ... | 3 | 3.000 | 0 | 3 | 1 | 3 | 3 | 3.000 | 3.000 | 2.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 90894 | Female | disloyal Customer | 20.000 | Personal Travel | Business | 2946 | 8 | 37.000 | 0 | 3 | ... | 3 | 3.000 | 3 | 5 | 3 | 5 | 3 | 3.000 | 3.000 | 4.000 |
| 90910 | Female | NaN | 70.000 | Personal Travel | Eco | 1674 | 54 | 46.000 | 1 | 5 | ... | 5 | 3.000 | 2 | 4 | 5 | 4 | 5 | 3.000 | 5.000 | 4.000 |
| 90914 | Male | disloyal Customer | 69.000 | Personal Travel | Eco | 2320 | 155 | 163.000 | 0 | 3 | ... | 4 | 4.000 | 3 | 4 | 2 | 3 | 2 | 3.000 | 3.000 | 3.000 |
| 90915 | Male | disloyal Customer | 66.000 | Personal Travel | Eco | 2450 | 193 | 205.000 | 0 | 3 | ... | 3 | 3.000 | 2 | 3 | 2 | 1 | 2 | 2.000 | 2.000 | 2.000 |
| 90916 | Female | disloyal Customer | 38.000 | Personal Travel | Eco | 4307 | 185 | 186.000 | 0 | 3 | ... | 4 | 5.000 | 5 | 5 | 3 | 3 | 3 | 4.000 | 3.000 | 3.000 |
11241 rows × 26 columns
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "Gender")
labeled_barplot(df, "CustomerType")
labeled_barplot(df, "TypeTravel")
labeled_barplot(df, "Class")
# Grouping customers for class
df_eco = df[df["Class"] == "Eco"]
df_ecoplus = df[df["Class"] == "Eco Plus"]
df_bus = df[df["Class"] == "Business"]
df_eco.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 40758 | 2 | Female | 20690 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 36590 | 2 | Loyal Customer | 28062 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 37415.000 | NaN | NaN | NaN | 39.667 | 15.401 | 15.000 | 26.000 | 38.000 | 52.000 | 85.000 |
| TypeTravel | 36674 | 2 | Personal Travel | 20738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 40758 | 1 | Eco | 40758 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 40758.000 | NaN | NaN | NaN | 1824.298 | 794.114 | 50.000 | 1379.000 | 1841.000 | 2277.000 | 6924.000 |
| DepartureDelayin_Mins | 40758.000 | NaN | NaN | NaN | 15.023 | 39.411 | 0.000 | 0.000 | 0.000 | 12.000 | 1592.000 |
| ArrivalDelayin_Mins | 40619.000 | NaN | NaN | NaN | 15.554 | 39.849 | 0.000 | 0.000 | 0.000 | 14.000 | 1584.000 |
| Satisfaction | 40758.000 | NaN | NaN | NaN | 0.393 | 0.488 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Seat_comfort | 40758.000 | NaN | NaN | NaN | 2.875 | 1.347 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 37121.000 | NaN | NaN | NaN | 3.075 | 1.547 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 37140.000 | NaN | NaN | NaN | 2.776 | 1.399 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 40758.000 | NaN | NaN | NaN | 3.000 | 1.239 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 40758.000 | NaN | NaN | NaN | 3.166 | 1.353 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 40758.000 | NaN | NaN | NaN | 3.057 | 1.378 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 40758.000 | NaN | NaN | NaN | 3.287 | 1.371 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 40758.000 | NaN | NaN | NaN | 3.306 | 1.352 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Onboard_service | 37578.000 | NaN | NaN | NaN | 3.273 | 1.308 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 40758.000 | NaN | NaN | NaN | 3.328 | 1.335 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Baggage_handling | 40758.000 | NaN | NaN | NaN | 3.569 | 1.171 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Checkin_service | 40758.000 | NaN | NaN | NaN | 3.185 | 1.299 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 40758.000 | NaN | NaN | NaN | 3.583 | 1.165 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_boarding | 40758.000 | NaN | NaN | NaN | 3.228 | 1.347 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 40758.000 | NaN | NaN | NaN | 3.262 | 0.814 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 40758.000 | NaN | NaN | NaN | 3.275 | 1.243 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 40758.000 | NaN | NaN | NaN | 3.129 | 0.804 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
df_ecoplus.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 6624 | 2 | Female | 3509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 5941 | 2 | Loyal Customer | 5363 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 6130.000 | NaN | NaN | NaN | 41.027 | 15.068 | 15.000 | 29.000 | 39.000 | 53.000 | 85.000 |
| TypeTravel | 5982 | 2 | Business travel | 3104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 6624 | 1 | Eco Plus | 6624 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 6624.000 | NaN | NaN | NaN | 1790.261 | 832.067 | 50.000 | 1326.000 | 1804.000 | 2261.000 | 6889.000 |
| DepartureDelayin_Mins | 6624.000 | NaN | NaN | NaN | 15.174 | 36.029 | 0.000 | 0.000 | 0.000 | 13.000 | 469.000 |
| ArrivalDelayin_Mins | 6601.000 | NaN | NaN | NaN | 15.613 | 36.138 | 0.000 | 0.000 | 0.000 | 15.000 | 518.000 |
| Satisfaction | 6624.000 | NaN | NaN | NaN | 0.426 | 0.495 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Seat_comfort | 6624.000 | NaN | NaN | NaN | 2.931 | 1.384 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 6039.000 | NaN | NaN | NaN | 3.106 | 1.487 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 6038.000 | NaN | NaN | NaN | 2.810 | 1.441 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 6624.000 | NaN | NaN | NaN | 2.982 | 1.296 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 6624.000 | NaN | NaN | NaN | 3.196 | 1.351 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 6624.000 | NaN | NaN | NaN | 3.097 | 1.382 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 6624.000 | NaN | NaN | NaN | 3.298 | 1.349 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 6624.000 | NaN | NaN | NaN | 3.323 | 1.342 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Onboard_service | 6091.000 | NaN | NaN | NaN | 3.168 | 1.326 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 6624.000 | NaN | NaN | NaN | 3.284 | 1.355 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 6624.000 | NaN | NaN | NaN | 3.462 | 1.182 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Checkin_service | 6624.000 | NaN | NaN | NaN | 3.077 | 1.323 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 6624.000 | NaN | NaN | NaN | 3.492 | 1.160 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_boarding | 6624.000 | NaN | NaN | NaN | 3.220 | 1.337 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 6624.000 | NaN | NaN | NaN | 3.175 | 0.820 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 6624.000 | NaN | NaN | NaN | 3.283 | 1.222 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 6624.000 | NaN | NaN | NaN | 3.137 | 0.810 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
df_bus.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 43535 | 2 | Female | 21987 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 39287 | 2 | Loyal Customer | 33472 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 42884.000 | NaN | NaN | NaN | 42.049 | 12.294 | 15.000 | 33.000 | 42.000 | 51.000 | 85.000 |
| TypeTravel | 39173 | 2 | Business travel | 37441 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 43535 | 1 | Business | 43535 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 43535.000 | NaN | NaN | NaN | 2158.043 | 1202.562 | 50.000 | 1326.000 | 2101.000 | 3001.000 | 6950.000 |
| DepartureDelayin_Mins | 43535.000 | NaN | NaN | NaN | 14.298 | 38.352 | 0.000 | 0.000 | 0.000 | 12.000 | 1305.000 |
| ArrivalDelayin_Mins | 43413.000 | NaN | NaN | NaN | 14.511 | 38.689 | 0.000 | 0.000 | 0.000 | 12.000 | 1280.000 |
| Satisfaction | 43535.000 | NaN | NaN | NaN | 0.711 | 0.453 | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 |
| Seat_comfort | 43535.000 | NaN | NaN | NaN | 2.791 | 1.435 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 39513.000 | NaN | NaN | NaN | 2.899 | 1.504 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 39558.000 | NaN | NaN | NaN | 2.925 | 1.479 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 43535.000 | NaN | NaN | NaN | 2.983 | 1.371 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 43535.000 | NaN | NaN | NaN | 3.340 | 1.278 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 43535.000 | NaN | NaN | NaN | 3.734 | 1.207 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Online_support | 43535.000 | NaN | NaN | NaN | 3.770 | 1.190 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Ease_of_Onlinebooking | 43535.000 | NaN | NaN | NaN | 3.657 | 1.227 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Onboard_service | 40069.000 | NaN | NaN | NaN | 3.693 | 1.181 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Leg_room_service | 43535.000 | NaN | NaN | NaN | 3.667 | 1.215 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Baggage_handling | 43535.000 | NaN | NaN | NaN | 3.853 | 1.113 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin_service | 43535.000 | NaN | NaN | NaN | 3.526 | 1.185 | 0.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Cleanliness | 43535.000 | NaN | NaN | NaN | 3.857 | 1.111 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Online_boarding | 43535.000 | NaN | NaN | NaN | 3.489 | 1.233 | 0.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Facilities_avg | 43535.000 | NaN | NaN | NaN | 3.514 | 0.793 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_service_avg | 43535.000 | NaN | NaN | NaN | 3.638 | 1.046 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| InFlight_avg | 43535.000 | NaN | NaN | NaN | 3.332 | 0.814 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
labeled_barplot(df, "Satisfaction")
labeled_barplot(df, "Seat_comfort")
labeled_barplot(df, "Departure.Arrival.time_convenient")
labeled_barplot(df, "Food_drink")
labeled_barplot(df, "Gate_location")
labeled_barplot(df, "Inflightwifi_service")
labeled_barplot(df, "Inflight_entertainment")
labeled_barplot(df, "Online_support")
labeled_barplot(df, "Ease_of_Onlinebooking")
labeled_barplot(df, "Onboard_service")
labeled_barplot(df, "Leg_room_service")
labeled_barplot(df, "Baggage_handling")
labeled_barplot(df, "Checkin_service")
labeled_barplot(df, "Cleanliness")
labeled_barplot(df, "Online_boarding")
labeled_barplot(df, "Facilities_avg")
labeled_barplot(df, "Online_service_avg")
labeled_barplot(df, "InFlight_avg")
corr_matrix = df.corr()
corr_matrix
| Age | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | Departure.Arrival.time_convenient | Food_drink | Gate_location | Inflightwifi_service | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000 | -0.256 | -0.007 | -0.010 | 0.105 | 0.009 | 0.065 | 0.015 | -0.002 | 0.007 | ... | 0.070 | 0.070 | 0.092 | -0.009 | 0.032 | -0.011 | 0.031 | 0.030 | 0.080 | 0.073 |
| Flight_Distance | -0.256 | 1.000 | 0.111 | 0.108 | -0.040 | -0.046 | -0.001 | -0.009 | -0.003 | 0.012 | ... | -0.024 | -0.034 | -0.033 | 0.017 | 0.004 | 0.007 | 0.010 | -0.004 | -0.017 | -0.022 |
| DepartureDelayin_Mins | -0.007 | 0.111 | 1.000 | 0.966 | -0.073 | -0.022 | 0.005 | -0.013 | 0.006 | -0.033 | ... | -0.036 | -0.039 | 0.002 | -0.011 | -0.021 | -0.062 | -0.019 | -0.023 | -0.032 | -0.038 |
| ArrivalDelayin_Mins | -0.010 | 0.108 | 0.966 | 1.000 | -0.079 | -0.024 | 0.003 | -0.015 | 0.006 | -0.035 | ... | -0.039 | -0.041 | -0.000 | -0.015 | -0.023 | -0.067 | -0.021 | -0.027 | -0.035 | -0.042 |
| Satisfaction | 0.105 | -0.040 | -0.073 | -0.079 | 1.000 | 0.241 | -0.015 | 0.118 | -0.012 | 0.226 | ... | 0.430 | 0.351 | 0.305 | 0.259 | 0.267 | 0.259 | 0.335 | 0.327 | 0.430 | 0.392 |
| Seat_comfort | 0.009 | -0.046 | -0.022 | -0.024 | 0.241 | 1.000 | 0.435 | 0.719 | 0.408 | 0.130 | ... | 0.210 | 0.120 | 0.137 | 0.118 | 0.045 | 0.109 | 0.131 | 0.245 | 0.173 | 0.712 |
| Departure.Arrival.time_convenient | 0.065 | -0.001 | 0.005 | 0.003 | -0.015 | 0.435 | 1.000 | 0.527 | 0.546 | -0.005 | ... | 0.000 | 0.060 | 0.027 | 0.068 | 0.063 | 0.064 | 0.000 | 0.277 | 0.001 | 0.535 |
| Food_drink | 0.015 | -0.009 | -0.013 | -0.015 | 0.118 | 0.719 | 0.527 | 1.000 | 0.526 | 0.027 | ... | 0.041 | 0.039 | 0.075 | 0.037 | 0.015 | 0.033 | 0.015 | 0.229 | 0.033 | 0.676 |
| Gate_location | -0.002 | -0.003 | 0.006 | 0.006 | -0.012 | 0.408 | 0.546 | 0.526 | 1.000 | -0.004 | ... | 0.001 | -0.024 | -0.007 | -0.001 | -0.030 | -0.002 | -0.002 | 0.350 | -0.000 | 0.351 |
| Inflightwifi_service | 0.007 | 0.012 | -0.033 | -0.035 | 0.226 | 0.130 | -0.005 | 0.027 | -0.004 | 1.000 | ... | 0.603 | 0.058 | 0.035 | 0.041 | 0.091 | 0.039 | 0.631 | 0.071 | 0.668 | 0.356 |
| Inflight_entertainment | 0.126 | -0.029 | -0.029 | -0.032 | 0.522 | 0.423 | 0.077 | 0.365 | 0.000 | 0.251 | ... | 0.319 | 0.181 | 0.160 | 0.115 | 0.226 | 0.109 | 0.354 | 0.197 | 0.416 | 0.576 |
| Online_support | 0.116 | -0.031 | -0.033 | -0.035 | 0.387 | 0.122 | 0.002 | 0.032 | 0.004 | 0.557 | ... | 0.617 | 0.156 | 0.141 | 0.103 | 0.204 | 0.096 | 0.671 | 0.177 | 0.852 | 0.333 |
| Ease_of_Onlinebooking | 0.070 | -0.024 | -0.036 | -0.039 | 0.430 | 0.210 | 0.000 | 0.041 | 0.001 | 0.603 | ... | 1.000 | 0.432 | 0.355 | 0.399 | 0.137 | 0.418 | 0.684 | 0.356 | 0.859 | 0.455 |
| Onboard_service | 0.070 | -0.034 | -0.039 | -0.041 | 0.351 | 0.120 | 0.060 | 0.039 | -0.024 | 0.058 | ... | 0.432 | 1.000 | 0.409 | 0.527 | 0.249 | 0.549 | 0.135 | 0.685 | 0.269 | 0.326 |
| Leg_room_service | 0.092 | -0.033 | 0.002 | -0.000 | 0.305 | 0.137 | 0.027 | 0.075 | -0.007 | 0.035 | ... | 0.355 | 0.409 | 1.000 | 0.408 | 0.169 | 0.411 | 0.114 | 0.364 | 0.227 | 0.427 |
| Baggage_handling | -0.009 | 0.017 | -0.011 | -0.015 | 0.259 | 0.118 | 0.068 | 0.037 | -0.001 | 0.041 | ... | 0.399 | 0.527 | 0.408 | 1.000 | 0.241 | 0.632 | 0.114 | 0.665 | 0.230 | 0.322 |
| Checkin_service | 0.032 | 0.004 | -0.021 | -0.023 | 0.267 | 0.045 | 0.063 | 0.015 | -0.030 | 0.091 | ... | 0.137 | 0.249 | 0.169 | 0.241 | 1.000 | 0.242 | 0.183 | 0.571 | 0.196 | 0.198 |
| Cleanliness | -0.011 | 0.007 | -0.062 | -0.067 | 0.259 | 0.109 | 0.064 | 0.033 | -0.002 | 0.039 | ... | 0.418 | 0.549 | 0.411 | 0.632 | 0.242 | 1.000 | 0.106 | 0.535 | 0.231 | 0.393 |
| Online_boarding | 0.031 | 0.010 | -0.019 | -0.021 | 0.335 | 0.131 | 0.000 | 0.015 | -0.002 | 0.631 | ... | 0.684 | 0.135 | 0.114 | 0.114 | 0.183 | 0.106 | 1.000 | 0.162 | 0.883 | 0.323 |
| Facilities_avg | 0.030 | -0.004 | -0.023 | -0.027 | 0.327 | 0.245 | 0.277 | 0.229 | 0.350 | 0.071 | ... | 0.356 | 0.685 | 0.364 | 0.665 | 0.571 | 0.535 | 0.162 | 1.000 | 0.259 | 0.446 |
| Online_service_avg | 0.080 | -0.017 | -0.032 | -0.035 | 0.430 | 0.173 | 0.001 | 0.033 | -0.000 | 0.668 | ... | 0.859 | 0.269 | 0.227 | 0.230 | 0.196 | 0.231 | 0.883 | 0.259 | 1.000 | 0.414 |
| InFlight_avg | 0.073 | -0.022 | -0.038 | -0.042 | 0.392 | 0.712 | 0.535 | 0.676 | 0.351 | 0.356 | ... | 0.455 | 0.326 | 0.427 | 0.322 | 0.198 | 0.393 | 0.323 | 0.446 | 0.414 | 1.000 |
22 rows × 22 columns
sns.set(style="white", font_scale=2.2)
fig = plt.figure(figsize=[35, 30])
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(150, 0, as_cmap=True)
sns.heatmap(
corr_matrix,
cmap="seismic",
linewidth=3,
linecolor="white",
vmax=1,
vmin=-1,
mask=mask,
annot=True,
fmt="0.2f",
)
plt.title("Correlation Heatmap", weight="bold", fontsize=50)
plt.savefig("heatmap.png", transparent=True, bbox_inches="tight")
sns.pairplot(data=df)
<seaborn.axisgrid.PairGrid at 0x28fc7362400>
# function to plot stacked bar chart
def stacked_barplot(df, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = df[predictor].nunique()
sorter = df[target].value_counts().index[-1]
tab1 = pd.crosstab(df[predictor], df[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(df[predictor], df[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(df, "Gender", "Satisfaction")
Satisfaction 0 1 All Gender All 41156 49761 90917 Male 25025 19706 44731 Female 16131 30055 46186 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CustomerType", "Satisfaction")
Satisfaction 0 1 All CustomerType All 36931 44887 81818 Loyal Customer 25580 41317 66897 disloyal Customer 11351 3570 14921 ------------------------------------------------------------------------------------------------------------------------
df_loyal = df.query('CustomerType == "Loyal Customer" & Satisfaction==0')
df_loyal.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 25580 | 2 | Male | 17352 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 25580 | 1 | Loyal Customer | 25580 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 23601.000 | NaN | NaN | NaN | 42.773 | 15.089 | 15.000 | 31.000 | 43.000 | 55.000 | 85.000 |
| TypeTravel | 22772 | 2 | Personal Travel | 11978 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 25580 | 3 | Eco | 14867 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 25580.000 | NaN | NaN | NaN | 2032.279 | 986.980 | 50.000 | 1463.750 | 1964.000 | 2539.000 | 6924.000 |
| DepartureDelayin_Mins | 25580.000 | NaN | NaN | NaN | 18.804 | 47.194 | 0.000 | 0.000 | 0.000 | 17.000 | 1592.000 |
| ArrivalDelayin_Mins | 25495.000 | NaN | NaN | NaN | 19.470 | 47.407 | 0.000 | 0.000 | 0.000 | 18.000 | 1584.000 |
| Satisfaction | 25580.000 | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Seat_comfort | 25580.000 | NaN | NaN | NaN | 2.498 | 1.009 | 0.000 | 2.000 | 3.000 | 3.000 | 5.000 |
| Departure.Arrival.time_convenient | 23311.000 | NaN | NaN | NaN | 3.338 | 1.450 | 0.000 | 2.000 | 4.000 | 5.000 | 5.000 |
| Food_drink | 23260.000 | NaN | NaN | NaN | 2.777 | 1.313 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 25580.000 | NaN | NaN | NaN | 2.993 | 1.274 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 25580.000 | NaN | NaN | NaN | 2.878 | 1.328 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 25580.000 | NaN | NaN | NaN | 2.688 | 1.108 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 25580.000 | NaN | NaN | NaN | 2.942 | 1.224 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 25580.000 | NaN | NaN | NaN | 2.774 | 1.257 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Onboard_service | 23584.000 | NaN | NaN | NaN | 2.931 | 1.255 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 25580.000 | NaN | NaN | NaN | 2.997 | 1.280 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 25580.000 | NaN | NaN | NaN | 3.283 | 1.153 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Checkin_service | 25580.000 | NaN | NaN | NaN | 2.931 | 1.270 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 25580.000 | NaN | NaN | NaN | 3.301 | 1.152 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_boarding | 25580.000 | NaN | NaN | NaN | 2.790 | 1.280 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 25580.000 | NaN | NaN | NaN | 3.035 | 0.830 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 25580.000 | NaN | NaN | NaN | 2.836 | 1.123 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 25580.000 | NaN | NaN | NaN | 2.924 | 0.703 | 1.000 | 2.000 | 3.000 | 3.000 | 5.000 |
df_disloyal = df.query('CustomerType == "disloyal Customer" & Satisfaction==0')
df_disloyal.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 11351 | 2 | Female | 6248 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 11351 | 1 | disloyal Customer | 11351 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 10969.000 | NaN | NaN | NaN | 31.788 | 10.946 | 15.000 | 24.000 | 28.000 | 38.000 | 85.000 |
| TypeTravel | 10065 | 2 | Business travel | 9972 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 11351 | 3 | Eco | 7314 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 11351.000 | NaN | NaN | NaN | 2021.777 | 601.981 | 220.000 | 1613.500 | 1952.000 | 2330.000 | 6837.000 |
| DepartureDelayin_Mins | 11351.000 | NaN | NaN | NaN | 16.008 | 38.108 | 0.000 | 0.000 | 0.000 | 15.000 | 569.000 |
| ArrivalDelayin_Mins | 11316.000 | NaN | NaN | NaN | 16.639 | 38.688 | 0.000 | 0.000 | 0.000 | 16.000 | 600.000 |
| Satisfaction | 11351.000 | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Seat_comfort | 11351.000 | NaN | NaN | NaN | 2.403 | 0.950 | 1.000 | 2.000 | 2.000 | 3.000 | 4.000 |
| Departure.Arrival.time_convenient | 10330.000 | NaN | NaN | NaN | 2.297 | 1.371 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Food_drink | 10333.000 | NaN | NaN | NaN | 2.409 | 1.015 | 0.000 | 2.000 | 2.000 | 3.000 | 5.000 |
| Gate_location | 11351.000 | NaN | NaN | NaN | 3.043 | 1.069 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 11351.000 | NaN | NaN | NaN | 3.029 | 1.389 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 11351.000 | NaN | NaN | NaN | 2.443 | 1.040 | 0.000 | 2.000 | 2.000 | 3.000 | 5.000 |
| Online_support | 11351.000 | NaN | NaN | NaN | 3.011 | 1.413 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 11351.000 | NaN | NaN | NaN | 3.052 | 1.391 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Onboard_service | 10436.000 | NaN | NaN | NaN | 3.089 | 1.284 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 11351.000 | NaN | NaN | NaN | 3.179 | 1.338 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 11351.000 | NaN | NaN | NaN | 3.565 | 1.073 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Checkin_service | 11351.000 | NaN | NaN | NaN | 3.065 | 1.296 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 11351.000 | NaN | NaN | NaN | 3.572 | 1.088 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_boarding | 11351.000 | NaN | NaN | NaN | 3.064 | 1.394 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 11351.000 | NaN | NaN | NaN | 3.199 | 0.779 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 11351.000 | NaN | NaN | NaN | 3.043 | 1.334 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 11351.000 | NaN | NaN | NaN | 2.768 | 0.697 | 1.000 | 2.000 | 3.000 | 3.000 | 5.000 |
stacked_barplot(df, "TypeTravel", "Satisfaction")
Satisfaction 0 1 All TypeTravel All 37062 44767 81829 Business travel 23490 32991 56481 Personal Travel 13572 11776 25348 ------------------------------------------------------------------------------------------------------------------------
df_PT = df.query('TypeTravel == "Personal Travel" & Satisfaction==0')
df_PT.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 13572 | 2 | Male | 11441 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 12071 | 2 | Loyal Customer | 11978 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 11909.000 | NaN | NaN | NaN | 42.591 | 16.192 | 15.000 | 29.000 | 43.000 | 57.000 | 70.000 |
| TypeTravel | 13572 | 1 | Personal Travel | 13572 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 13572 | 3 | Eco | 11112 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 13572.000 | NaN | NaN | NaN | 1992.392 | 768.556 | 50.000 | 1541.000 | 1947.000 | 2380.250 | 6924.000 |
| DepartureDelayin_Mins | 13572.000 | NaN | NaN | NaN | 18.076 | 46.841 | 0.000 | 0.000 | 0.000 | 14.000 | 1128.000 |
| ArrivalDelayin_Mins | 13508.000 | NaN | NaN | NaN | 18.384 | 46.923 | 0.000 | 0.000 | 0.000 | 15.000 | 1115.000 |
| Satisfaction | 13572.000 | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Seat_comfort | 13572.000 | NaN | NaN | NaN | 2.585 | 1.019 | 1.000 | 2.000 | 3.000 | 3.000 | 4.000 |
| Departure.Arrival.time_convenient | 12361.000 | NaN | NaN | NaN | 3.675 | 1.373 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Food_drink | 12352.000 | NaN | NaN | NaN | 2.559 | 1.159 | 0.000 | 2.000 | 3.000 | 3.000 | 5.000 |
| Gate_location | 13572.000 | NaN | NaN | NaN | 2.955 | 1.153 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 13572.000 | NaN | NaN | NaN | 3.038 | 1.378 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 13572.000 | NaN | NaN | NaN | 2.628 | 1.218 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 13572.000 | NaN | NaN | NaN | 3.083 | 1.408 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 13572.000 | NaN | NaN | NaN | 3.061 | 1.385 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Onboard_service | 12521.000 | NaN | NaN | NaN | 3.309 | 1.271 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 13572.000 | NaN | NaN | NaN | 3.262 | 1.270 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 13572.000 | NaN | NaN | NaN | 3.686 | 1.163 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin_service | 13572.000 | NaN | NaN | NaN | 3.342 | 1.258 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 13572.000 | NaN | NaN | NaN | 3.709 | 1.149 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Online_boarding | 13572.000 | NaN | NaN | NaN | 3.087 | 1.383 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 13572.000 | NaN | NaN | NaN | 3.337 | 0.788 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 13572.000 | NaN | NaN | NaN | 3.078 | 1.290 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 13572.000 | NaN | NaN | NaN | 3.068 | 0.693 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
df_BT = df.query('TypeTravel == "Business travel" & Satisfaction==0')
df_BT.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 23490 | 2 | Female | 12396 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 20766 | 2 | Loyal Customer | 10794 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 22828.000 | NaN | NaN | NaN | 37.630 | 13.675 | 15.000 | 26.000 | 36.000 | 47.000 | 85.000 |
| TypeTravel | 23490 | 1 | Business travel | 23490 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 23490 | 3 | Eco | 11168 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 23490.000 | NaN | NaN | NaN | 2043.058 | 947.637 | 50.000 | 1498.000 | 1962.000 | 2509.000 | 6837.000 |
| DepartureDelayin_Mins | 23490.000 | NaN | NaN | NaN | 17.626 | 41.664 | 0.000 | 0.000 | 0.000 | 17.000 | 951.000 |
| ArrivalDelayin_Mins | 23430.000 | NaN | NaN | NaN | 18.505 | 42.247 | 0.000 | 0.000 | 1.000 | 18.000 | 952.000 |
| Satisfaction | 23490.000 | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Seat_comfort | 23490.000 | NaN | NaN | NaN | 2.404 | 0.973 | 0.000 | 2.000 | 2.000 | 3.000 | 5.000 |
| Departure.Arrival.time_convenient | 21407.000 | NaN | NaN | NaN | 2.638 | 1.443 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 21410.000 | NaN | NaN | NaN | 2.727 | 1.280 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 23490.000 | NaN | NaN | NaN | 3.035 | 1.247 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 23490.000 | NaN | NaN | NaN | 2.854 | 1.328 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 23490.000 | NaN | NaN | NaN | 2.607 | 1.014 | 0.000 | 2.000 | 3.000 | 3.000 | 5.000 |
| Online_support | 23490.000 | NaN | NaN | NaN | 2.892 | 1.207 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 23490.000 | NaN | NaN | NaN | 2.737 | 1.245 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Onboard_service | 21568.000 | NaN | NaN | NaN | 2.784 | 1.221 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Leg_room_service | 23490.000 | NaN | NaN | NaN | 2.934 | 1.304 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 23490.000 | NaN | NaN | NaN | 3.187 | 1.085 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Checkin_service | 23490.000 | NaN | NaN | NaN | 2.759 | 1.241 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 23490.000 | NaN | NaN | NaN | 3.193 | 1.088 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_boarding | 23490.000 | NaN | NaN | NaN | 2.747 | 1.271 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 23490.000 | NaN | NaN | NaN | 2.937 | 0.799 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 23490.000 | NaN | NaN | NaN | 2.793 | 1.125 | 1.000 | 2.000 | 3.000 | 3.000 | 5.000 |
| InFlight_avg | 23490.000 | NaN | NaN | NaN | 2.765 | 0.688 | 1.000 | 2.000 | 3.000 | 3.000 | 5.000 |
stacked_barplot(df, "Class", "Satisfaction")
Satisfaction 0 1 All Class All 41156 49761 90917 Eco 24755 16003 40758 Business 12600 30935 43535 Eco Plus 3801 2823 6624 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Seat_comfort", "Satisfaction")
Satisfaction 0 1 All Seat_comfort All 41156 49761 90917 3 13274 7278 20552 2 12904 7098 20002 1 7997 6690 14687 4 6878 12911 19789 5 97 12422 12519 0 6 3362 3368 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Departure.Arrival.time_convenient", "Satisfaction")
Satisfaction 0 1 All Departure.Arrival.time_convenient All 37501 45172 82673 4.0 8963 9877 18840 5.0 7552 9527 17079 3.0 6873 7933 14806 2.0 6789 7750 14539 1.0 5391 7819 13210 0.0 1933 2266 4199 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Food_drink", "Satisfaction")
Satisfaction 0 1 All Food_drink All 37469 45267 82736 3.0 10278 7713 17991 2.0 9903 7456 17359 4.0 7091 10154 17245 1.0 6521 6879 13400 5.0 2851 10096 12947 0.0 825 2969 3794 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Gate_location", "Satisfaction")
Satisfaction 0 1 All Gate_location All 41156 49761 90917 3 12585 10800 23385 4 10621 10467 21088 2 7222 9891 17113 1 6133 9743 15876 5 4595 8859 13454 0 0 1 1 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Inflightwifi_service", "Satisfaction")
Satisfaction 0 1 All Inflightwifi_service All 41156 49761 90917 2 9447 9447 18894 3 9382 9817 19199 4 7994 14165 22159 1 7527 2784 10311 5 6753 13505 20258 0 53 43 96 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Inflight_entertainment", "Satisfaction")
Satisfaction 0 1 All Inflight_entertainment All 41156 49761 90917 3 13641 3354 16995 2 11181 2346 13527 4 8186 21187 29373 1 6459 1739 8198 5 1011 19775 20786 0 678 1360 2038 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Online_support", "Satisfaction")
Satisfaction 0 1 All Online_support All 41156 49761 90917 3 10834 4256 15090 4 9229 19813 29042 2 8501 3562 12063 1 6867 2938 9805 5 5724 19192 24916 0 1 0 1 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Ease_of_Onlinebooking", "Satisfaction")
Satisfaction 0 1 All Ease_of_Onlinebooking All 41156 49761 90917 3 10067 5619 15686 2 9944 3952 13896 4 7867 20126 27993 1 7560 1810 9370 5 5706 18254 23960 0 12 0 12 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Onboard_service", "Satisfaction")
Satisfaction 0 1 All Onboard_service All 37910 45828 83738 3.0 10278 7133 17411 4.0 9353 17020 26373 2.0 7271 3747 11018 1.0 6236 2301 8537 5.0 4769 15627 20396 0.0 3 0 3 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Leg_room_service", "Satisfaction")
Satisfaction 0 1 All Leg_room_service All 41156 49761 90917 3 9952 5823 15775 2 9475 5681 15156 4 9051 18763 27814 5 7018 17053 24071 1 5556 2223 7779 0 104 218 322 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Baggage_handling", "Satisfaction")
Satisfaction 0 1 All Baggage_handling All 41156 49761 90917 4 13927 19895 33822 3 11771 5462 17233 5 6629 18373 25002 2 5627 3674 9301 1 3202 2357 5559 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Checkin_service", "Satisfaction")
Satisfaction 0 1 All Checkin_service All 41156 49761 90917 3 10832 14109 24941 4 10728 14755 25483 1 7352 3409 10761 2 7238 3575 10813 5 5005 13913 18918 0 1 0 1 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Cleanliness", "Satisfaction")
Satisfaction 0 1 All Cleanliness All 41156 49761 90917 4 14151 20095 34246 3 11543 5387 16930 5 6730 18349 25079 2 5544 3739 9283 1 3184 2191 5375 0 4 0 4 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Online_boarding", "Satisfaction")
Satisfaction 0 1 All Online_boarding All 41156 49761 90917 3 9636 11791 21427 2 9356 3679 13035 4 8516 16160 24676 1 7904 2873 10777 5 5735 15258 20993 0 9 0 9 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Facilities_avg", "Satisfaction")
Satisfaction 0 1 All Facilities_avg All 41156 49761 90917 3.0 15738 12307 28045 4.0 13840 30249 44089 2.0 10774 4516 15290 5.0 491 2608 3099 1.0 313 81 394 ------------------------------------------------------------------------------------------------------------------------
sns.set(font_scale=1)
fig, axarr = plt.subplots(2, 2, figsize=(12, 12))
parameters = {"axes.labelsize": 15, "axes.titlesize": 10}
plt.rcParams.update(parameters)
table1 = pd.crosstab(df["Satisfaction"], df["Gate_location"])
sns.heatmap(table1, cmap="Oranges", ax=axarr[0][0])
table2 = pd.crosstab(df["Satisfaction"], df["Onboard_service"])
sns.heatmap(table2, cmap="Blues", ax=axarr[0][1])
table3 = pd.crosstab(df["Satisfaction"], df["Baggage_handling"])
sns.heatmap(table3, cmap="pink", ax=axarr[1][0])
table4 = pd.crosstab(df["Satisfaction"], df["Checkin_service"])
sns.heatmap(table4, cmap="bone", ax=axarr[1][1])
# table5 = pd.crosstab(df["Satisfaction"], df["Facilities_avg"])
# sns.heatmap(table5, cmap="red", ax=axarr[2][0])
<AxesSubplot:xlabel='Checkin_service', ylabel='Satisfaction'>
table5 = pd.crosstab(df["Satisfaction"], df["Facilities_avg"])
sns.heatmap(table5, cmap="Oranges")
<AxesSubplot:xlabel='Facilities_avg', ylabel='Satisfaction'>
stacked_barplot(df, "Online_service_avg", "Satisfaction")
Satisfaction 0 1 All Online_service_avg All 41156 49761 90917 3.0 12498 7445 19943 2.0 10421 3059 13480 4.0 8057 25820 33877 1.0 5544 1155 6699 5.0 4636 12282 16918 ------------------------------------------------------------------------------------------------------------------------
fig, axarr = plt.subplots(2, 2, figsize=(12, 12))
table6 = pd.crosstab(df["Satisfaction"], df["Online_support"])
sns.heatmap(table6, cmap="Oranges", ax=axarr[0][0])
table7 = pd.crosstab(df["Satisfaction"], df["Ease_of_Onlinebooking"])
sns.heatmap(table7, cmap="Blues", ax=axarr[0][1])
table8 = pd.crosstab(df["Satisfaction"], df["Online_boarding"])
sns.heatmap(table8, cmap="pink", ax=axarr[1][0])
table9 = pd.crosstab(df["Satisfaction"], df["Online_service_avg"])
sns.heatmap(table9, cmap="bone", ax=axarr[1][1])
<AxesSubplot:xlabel='Online_service_avg', ylabel='Satisfaction'>
stacked_barplot(df, "InFlight_avg", "Satisfaction")
Satisfaction 0 1 All InFlight_avg All 41156 49761 90917 3.0 21628 17966 39594 2.0 11644 3890 15534 4.0 7370 23812 31182 1.0 467 586 1053 5.0 47 3504 3551 0.0 0 3 3 ------------------------------------------------------------------------------------------------------------------------
fig, axarr = plt.subplots(2, 2, figsize=(12, 12))
table10 = pd.crosstab(df['Satisfaction'], df['Seat_comfort'])
sns.heatmap(table10, cmap='Oranges', ax = axarr[0][0])
table11 = pd.crosstab(df['Satisfaction'], df['Departure.Arrival.time_convenient'])
sns.heatmap(table11, cmap='Blues', ax = axarr[0][1])
table12 = pd.crosstab(df['Satisfaction'], df['Food_drink'])
sns.heatmap(table12, cmap='pink', ax = axarr[1][0])
table13 = pd.crosstab(df['Satisfaction'], df['Inflightwifi_service'])
sns.heatmap(table13, cmap='bone', ax = axarr[1][1])
<AxesSubplot:xlabel='Inflightwifi_service', ylabel='Satisfaction'>
fig, axarr = plt.subplots(2, 2, figsize=(12, 12))
table14 = pd.crosstab(df["Satisfaction"], df["Inflight_entertainment"])
sns.heatmap(table14, cmap="Oranges", ax=axarr[0][0])
table15 = pd.crosstab(df["Satisfaction"], df["Leg_room_service"])
sns.heatmap(table15, cmap="Blues", ax=axarr[0][1])
table16 = pd.crosstab(df["Satisfaction"], df["Cleanliness"])
sns.heatmap(table16, cmap="pink", ax=axarr[1][0])
table17 = pd.crosstab(df["Satisfaction"], df["InFlight_avg"])
sns.heatmap(table17, cmap="bone", ax=axarr[1][1])
<AxesSubplot:xlabel='InFlight_avg', ylabel='Satisfaction'>
### Function to plot distributions
def distribution_plot_wrt_target(df, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = df[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=df[df[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=df[df[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=df, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=df,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(df, "Age", "Satisfaction")
sns.set(font_scale=1)
with sns.axes_style("white"):
g = sns.catplot(
"Age",
data=df,
aspect=3.0,
kind="count",
hue="Satisfaction",
order=range(15, 80),
)
g.set_ylabels("Age vs Passenger Satisfaction")
Age_young = df[df.Age < 39]
Age_middle = df[(df.Age > 38) | (df.Age < 61)]
Age_old = df[df.Age > 60]
Age_young.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 37730 | 2 | Female | 19370 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 33996 | 2 | Loyal Customer | 22565 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 37730.000 | NaN | NaN | NaN | 27.748 | 6.404 | 15.000 | 23.000 | 27.000 | 33.000 | 38.000 |
| TypeTravel | 33969 | 2 | Business travel | 24433 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 37730 | 3 | Eco | 18877 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 37730.000 | NaN | NaN | NaN | 2258.925 | 931.695 | 50.000 | 1659.000 | 2072.000 | 2664.000 | 6907.000 |
| DepartureDelayin_Mins | 37730.000 | NaN | NaN | NaN | 14.825 | 38.751 | 0.000 | 0.000 | 0.000 | 12.000 | 1305.000 |
| ArrivalDelayin_Mins | 37619.000 | NaN | NaN | NaN | 15.302 | 39.349 | 0.000 | 0.000 | 0.000 | 13.000 | 1280.000 |
| Satisfaction | 37730.000 | NaN | NaN | NaN | 0.462 | 0.499 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Seat_comfort | 37730.000 | NaN | NaN | NaN | 2.806 | 1.382 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 34279.000 | NaN | NaN | NaN | 2.877 | 1.572 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 34335.000 | NaN | NaN | NaN | 2.813 | 1.431 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 37730.000 | NaN | NaN | NaN | 2.992 | 1.271 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 37730.000 | NaN | NaN | NaN | 3.231 | 1.347 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 37730.000 | NaN | NaN | NaN | 3.148 | 1.397 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 37730.000 | NaN | NaN | NaN | 3.317 | 1.359 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 37730.000 | NaN | NaN | NaN | 3.342 | 1.341 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Onboard_service | 34759.000 | NaN | NaN | NaN | 3.331 | 1.286 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Leg_room_service | 37730.000 | NaN | NaN | NaN | 3.317 | 1.319 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Baggage_handling | 37730.000 | NaN | NaN | NaN | 3.676 | 1.132 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin_service | 37730.000 | NaN | NaN | NaN | 3.265 | 1.283 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 37730.000 | NaN | NaN | NaN | 3.689 | 1.134 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Online_boarding | 37730.000 | NaN | NaN | NaN | 3.279 | 1.339 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 37730.000 | NaN | NaN | NaN | 3.323 | 0.802 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 37730.000 | NaN | NaN | NaN | 3.314 | 1.258 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| InFlight_avg | 37730.000 | NaN | NaN | NaN | 3.136 | 0.825 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
Age_middle.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 86429 | 2 | Female | 43910 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 77775 | 2 | Loyal Customer | 63323 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 86429.000 | NaN | NaN | NaN | 40.945 | 13.967 | 15.000 | 29.000 | 41.000 | 52.000 | 85.000 |
| TypeTravel | 77819 | 2 | Business travel | 55585 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 86429 | 3 | Business | 42884 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 86429.000 | NaN | NaN | NaN | 1968.643 | 1036.967 | 50.000 | 1325.000 | 1915.000 | 2541.000 | 6950.000 |
| DepartureDelayin_Mins | 86429.000 | NaN | NaN | NaN | 14.653 | 38.502 | 0.000 | 0.000 | 0.000 | 12.000 | 1592.000 |
| ArrivalDelayin_Mins | 86164.000 | NaN | NaN | NaN | 15.020 | 38.881 | 0.000 | 0.000 | 0.000 | 13.000 | 1584.000 |
| Satisfaction | 86429.000 | NaN | NaN | NaN | 0.554 | 0.497 | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 |
| Seat_comfort | 86429.000 | NaN | NaN | NaN | 2.840 | 1.395 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 78596.000 | NaN | NaN | NaN | 2.981 | 1.525 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 78680.000 | NaN | NaN | NaN | 2.854 | 1.443 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 86429.000 | NaN | NaN | NaN | 2.991 | 1.312 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 86429.000 | NaN | NaN | NaN | 3.257 | 1.319 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 86429.000 | NaN | NaN | NaN | 3.400 | 1.336 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Online_support | 86429.000 | NaN | NaN | NaN | 3.533 | 1.302 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Ease_of_Onlinebooking | 86429.000 | NaN | NaN | NaN | 3.485 | 1.301 | 0.000 | 2.000 | 4.000 | 5.000 | 5.000 |
| Onboard_service | 79557.000 | NaN | NaN | NaN | 3.471 | 1.270 | 0.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Leg_room_service | 86429.000 | NaN | NaN | NaN | 3.493 | 1.290 | 0.000 | 2.000 | 4.000 | 5.000 | 5.000 |
| Baggage_handling | 86429.000 | NaN | NaN | NaN | 3.695 | 1.158 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin_service | 86429.000 | NaN | NaN | NaN | 3.342 | 1.260 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 86429.000 | NaN | NaN | NaN | 3.704 | 1.152 | 0.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Online_boarding | 86429.000 | NaN | NaN | NaN | 3.360 | 1.297 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Facilities_avg | 86429.000 | NaN | NaN | NaN | 3.377 | 0.817 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_service_avg | 86429.000 | NaN | NaN | NaN | 3.459 | 1.159 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| InFlight_avg | 86429.000 | NaN | NaN | NaN | 3.230 | 0.816 | 0.000 | 3.000 | 3.000 | 4.000 | 5.000 |
Age_old.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gender | 7128 | 2 | Female | 3581 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CustomerType | 6406 | 2 | Loyal Customer | 6084 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 7128.000 | NaN | NaN | NaN | 66.105 | 4.148 | 61.000 | 63.000 | 66.000 | 69.000 | 85.000 |
| TypeTravel | 6420 | 2 | Personal Travel | 3982 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Class | 7128 | 3 | Eco | 4468 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Flight_Distance | 7128.000 | NaN | NaN | NaN | 1525.207 | 1003.817 | 50.000 | 608.000 | 1511.000 | 2171.000 | 6889.000 |
| DepartureDelayin_Mins | 7128.000 | NaN | NaN | NaN | 14.177 | 36.030 | 0.000 | 0.000 | 0.000 | 12.000 | 565.000 |
| ArrivalDelayin_Mins | 7099.000 | NaN | NaN | NaN | 14.502 | 36.582 | 0.000 | 0.000 | 0.000 | 12.000 | 624.000 |
| Satisfaction | 7128.000 | NaN | NaN | NaN | 0.433 | 0.496 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Seat_comfort | 7128.000 | NaN | NaN | NaN | 2.786 | 1.358 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Departure.Arrival.time_convenient | 6486.000 | NaN | NaN | NaN | 3.197 | 1.487 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Food_drink | 6464.000 | NaN | NaN | NaN | 2.815 | 1.448 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Gate_location | 7128.000 | NaN | NaN | NaN | 2.964 | 1.289 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflightwifi_service | 7128.000 | NaN | NaN | NaN | 3.141 | 1.341 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Inflight_entertainment | 7128.000 | NaN | NaN | NaN | 3.227 | 1.313 | 0.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Online_support | 7128.000 | NaN | NaN | NaN | 3.413 | 1.292 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Ease_of_Onlinebooking | 7128.000 | NaN | NaN | NaN | 3.341 | 1.326 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Onboard_service | 6559.000 | NaN | NaN | NaN | 3.357 | 1.296 | 1.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Leg_room_service | 7128.000 | NaN | NaN | NaN | 3.410 | 1.306 | 0.000 | 2.000 | 4.000 | 4.000 | 5.000 |
| Baggage_handling | 7128.000 | NaN | NaN | NaN | 3.552 | 1.225 | 1.000 | 3.000 | 4.000 | 5.000 | 5.000 |
| Checkin_service | 7128.000 | NaN | NaN | NaN | 3.230 | 1.276 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Cleanliness | 7128.000 | NaN | NaN | NaN | 3.562 | 1.197 | 1.000 | 3.000 | 4.000 | 4.000 | 5.000 |
| Online_boarding | 7128.000 | NaN | NaN | NaN | 3.196 | 1.320 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| Facilities_avg | 7128.000 | NaN | NaN | NaN | 3.276 | 0.839 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Online_service_avg | 7128.000 | NaN | NaN | NaN | 3.315 | 1.127 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| InFlight_avg | 7128.000 | NaN | NaN | NaN | 3.169 | 0.797 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
distribution_plot_wrt_target(df, "Flight_Distance", "Satisfaction")
distribution_plot_wrt_target(df, "DepartureDelayin_Mins", "Satisfaction")
distribution_plot_wrt_target(df, "ArrivalDelayin_Mins", "Satisfaction")
cols = df[
["Age", "Flight_Distance", "DepartureDelayin_Mins", "ArrivalDelayin_Mins"]
].columns.tolist()
plt.figure(figsize=(12, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.lineplot(df["Age"], df[variable], hue=df["Satisfaction"], ci=0)
plt.tight_layout()
plt.title(variable)
plt.show()
with sns.axes_style("white"):
g = sns.catplot(
x="Age",
y="TypeTravel",
hue="Satisfaction",
col="Gender",
data=df,
kind="bar",
height=5,
aspect=1,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Age",
y="CustomerType",
hue="Satisfaction",
col="Gender",
data=df,
kind="bar",
height=5,
aspect=1,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Flight_Distance",
y="CustomerType",
hue="Satisfaction",
col="Gender",
data=df,
kind="bar",
height=5,
aspect=1,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Age",
y="TypeTravel",
hue="Satisfaction",
col="Class",
data=df,
kind="bar",
height=5,
aspect=1,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Age",
y="CustomerType",
hue="Satisfaction",
col="Class",
data=df,
kind="bar",
height=5,
aspect=1,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Flight_Distance",
y="TypeTravel",
hue="Satisfaction",
col="Class",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Flight_Distance",
y="CustomerType",
hue="Satisfaction",
col="Class",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Class",
y="DepartureDelayin_Mins",
hue="Satisfaction",
col="TypeTravel",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Class",
y="ArrivalDelayin_Mins",
hue="Satisfaction",
col="TypeTravel",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
cols = df[
[
"Gate_location",
"Onboard_service",
"Baggage_handling",
"Checkin_service",
"Facilities_avg",
]
].columns.tolist()
plt.figure(figsize=(12, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.lineplot(df["Age"], df[variable], hue=df["Satisfaction"], ci=0)
plt.tight_layout()
plt.title(variable)
plt.show()
cols = df[
["Online_support", "Ease_of_Onlinebooking", "Online_boarding", "Online_service_avg"]
].columns.tolist()
plt.figure(figsize=(12, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.lineplot(df["Age"], df[variable], hue=df["Satisfaction"], ci=0)
plt.tight_layout()
plt.title(variable)
plt.show()
cols = df[
[
"Seat_comfort",
"Departure.Arrival.time_convenient",
"Food_drink",
"Inflightwifi_service",
"Inflight_entertainment",
"Leg_room_service",
"Cleanliness",
"InFlight_avg",
]
].columns.tolist()
plt.figure(figsize=(12, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.lineplot(df["Age"], df[variable], hue=df["Satisfaction"], ci=0)
plt.tight_layout()
plt.title(variable)
plt.show()
with sns.axes_style("white"):
g = sns.catplot(
x="Inflight_entertainment",
y="Flight_Distance",
hue="Satisfaction",
col="TypeTravel",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
with sns.axes_style("white"):
g = sns.catplot(
x="Inflight_entertainment",
y="Age",
hue="Satisfaction",
col="Gender",
data=df,
kind="bar",
height=4.5,
aspect=0.8,
)
Q1 = df.quantile(0.25) # To find the 25th percentile and 75th percentile.
Q3 = df.quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
(
(df.select_dtypes(include=["float64", "int64"]) < lower)
| (df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(df) * 100
Age 0.000 Flight_Distance 1.980 DepartureDelayin_Mins 13.881 ArrivalDelayin_Mins 13.403 Satisfaction 0.000 Seat_comfort 0.000 Departure.Arrival.time_convenient 0.000 Food_drink 0.000 Gate_location 0.000 Inflightwifi_service 0.000 Inflight_entertainment 0.000 Online_support 0.000 Ease_of_Onlinebooking 0.000 Onboard_service 9.393 Leg_room_service 0.000 Baggage_handling 0.000 Checkin_service 11.837 Cleanliness 0.000 Online_boarding 0.000 Facilities_avg 0.433 Online_service_avg 7.368 InFlight_avg 1.161 dtype: float64
imputer = KNNImputer(n_neighbors=5)
# defining a list with names of columns that will be used for imputation
reqd_col_for_impute = [
"Age",
"CustomerType",
"TypeTravel",
"ArrivalDelayin_Mins",
"Departure.Arrival.time_convenient",
"Food_drink",
"Onboard_service",
]
df[reqd_col_for_impute].head()
| Age | CustomerType | TypeTravel | ArrivalDelayin_Mins | Departure.Arrival.time_convenient | Food_drink | Onboard_service | |
|---|---|---|---|---|---|---|---|
| 0 | 65.000 | Loyal Customer | Personal Travel | 0.000 | 0.000 | 0.000 | 3.000 |
| 1 | 15.000 | Loyal Customer | Personal Travel | 0.000 | 0.000 | 0.000 | NaN |
| 2 | 60.000 | Loyal Customer | Personal Travel | 0.000 | NaN | 0.000 | 1.000 |
| 3 | 70.000 | Loyal Customer | Personal Travel | 0.000 | 0.000 | 0.000 | 2.000 |
| 4 | 30.000 | Loyal Customer | NaN | 0.000 | 0.000 | 0.000 | 5.000 |
# make a copy
data1 = df.copy()
# we need to pass numerical values for each categorical column for KNN imputation so we will label encode them
Gender = {"Male": 0, "Female": 1}
data1["Gender"] = data1["Gender"].map(Gender)
CustomerType = {"disloyal Customer": 0, "Loyal Customer": 1}
data1["CustomerType"] = data1["CustomerType"].map(CustomerType)
TypeTravel = {"Business travel": 0, "Personal Travel": 1}
data1["TypeTravel"] = data1["TypeTravel"].map(TypeTravel)
Class = {"Business": 0, "Eco": 1, "Eco Plus": 2}
data1["Class"] = data1["Class"].map(Class)
data1.head()
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 65.000 | 1 | 1 | 265 | 0 | 0.000 | 1 | 0 | ... | 3 | 3.000 | 0 | 3 | 5 | 3 | 2 | 3.000 | 2.000 | 1.000 |
| 1 | 1 | 1 | 15.000 | 1 | 1 | 2138 | 0 | 0.000 | 1 | 0 | ... | 2 | NaN | 3 | 4 | 4 | 4 | 2 | 4.000 | 2.000 | 1.000 |
| 2 | 1 | 1 | 60.000 | 1 | 1 | 623 | 0 | 0.000 | 1 | 0 | ... | 1 | 1.000 | 0 | 1 | 4 | 1 | 3 | 2.000 | 2.000 | 1.000 |
| 3 | 1 | 1 | 70.000 | 1 | 1 | 354 | 0 | 0.000 | 1 | 0 | ... | 2 | 2.000 | 0 | 2 | 4 | 2 | 5 | 3.000 | 4.000 | 1.000 |
| 4 | 0 | 1 | 30.000 | NaN | 1 | 1894 | 0 | 0.000 | 1 | 0 | ... | 2 | 5.000 | 4 | 5 | 5 | 4 | 2 | 4.000 | 2.000 | 1.000 |
5 rows × 26 columns
data1[reqd_col_for_impute] = imputer.fit_transform(data1[reqd_col_for_impute])
### checking missing values
data1.isna().sum()
Gender 0 CustomerType 0 Age 0 TypeTravel 0 Class 0 Flight_Distance 0 DepartureDelayin_Mins 0 ArrivalDelayin_Mins 0 Satisfaction 0 Seat_comfort 0 Departure.Arrival.time_convenient 0 Food_drink 0 Gate_location 0 Inflightwifi_service 0 Inflight_entertainment 0 Online_support 0 Ease_of_Onlinebooking 0 Onboard_service 0 Leg_room_service 0 Baggage_handling 0 Checkin_service 0 Cleanliness 0 Online_boarding 0 Facilities_avg 0 Online_service_avg 0 InFlight_avg 0 dtype: int64
I need to bring all of the features of a Machine Learning problem to a similar scale or range. Feature scaling can have a significant effect on a Machine Learning model’s training efficiency and can improve the time taken to train a model.
r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(data1)
modified_data = pd.DataFrame(r_scaler.transform(data1), columns=data1.columns)
modified_data.head()
| Gender | CustomerType | Age | TypeTravel | Class | Flight_Distance | DepartureDelayin_Mins | ArrivalDelayin_Mins | Satisfaction | Seat_comfort | ... | Ease_of_Onlinebooking | Onboard_service | Leg_room_service | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 0.714 | 1.000 | 0.500 | 0.031 | 0.000 | 0.000 | 1.000 | 0.000 | ... | 0.600 | 0.600 | 0.000 | 0.500 | 1.000 | 0.600 | 0.400 | 0.500 | 0.250 | 0.200 |
| 1 | 1.000 | 1.000 | 0.000 | 1.000 | 0.500 | 0.303 | 0.000 | 0.000 | 1.000 | 0.000 | ... | 0.400 | 0.760 | 0.600 | 0.750 | 0.800 | 0.800 | 0.400 | 0.750 | 0.250 | 0.200 |
| 2 | 1.000 | 1.000 | 0.643 | 1.000 | 0.500 | 0.083 | 0.000 | 0.000 | 1.000 | 0.000 | ... | 0.200 | 0.200 | 0.000 | 0.000 | 0.800 | 0.200 | 0.600 | 0.250 | 0.250 | 0.200 |
| 3 | 1.000 | 1.000 | 0.786 | 1.000 | 0.500 | 0.044 | 0.000 | 0.000 | 1.000 | 0.000 | ... | 0.400 | 0.400 | 0.000 | 0.250 | 0.800 | 0.400 | 1.000 | 0.500 | 0.750 | 0.200 |
| 4 | 0.000 | 1.000 | 0.214 | 0.400 | 0.500 | 0.267 | 0.000 | 0.000 | 1.000 | 0.000 | ... | 0.400 | 1.000 | 0.800 | 1.000 | 1.000 | 0.800 | 0.400 | 0.750 | 0.250 | 0.200 |
5 rows × 26 columns
from sklearn.feature_selection import SelectKBest, chi2
X = modified_data.loc[:, modified_data.columns != "Satisfaction"]
y = modified_data[["Satisfaction"]]
selector = SelectKBest(chi2, k=12)
selector.fit(X, y)
X_new = selector.transform(X)
print(X.columns[selector.get_support(indices=True)])
Index(['Gender', 'CustomerType', 'Class', 'Seat_comfort',
'Inflight_entertainment', 'Online_support', 'Ease_of_Onlinebooking',
'Onboard_service', 'Leg_room_service', 'Baggage_handling',
'Online_boarding', 'Online_service_avg'],
dtype='object')
lasso_model = Lasso(alpha=0.01)
selected_columns = list(X.columns)
lasso_model.fit(X, y)
list(zip(selected_columns, lasso_model.coef_))
[('Gender', 0.11103034320005893),
('CustomerType', 0.18178954305679604),
('Age', -0.0),
('TypeTravel', -0.04729928071345613),
('Class', -0.12816598551801273),
('Flight_Distance', -0.0),
('DepartureDelayin_Mins', -0.0),
('ArrivalDelayin_Mins', -0.0),
('Seat_comfort', 0.0),
('Departure.Arrival.time_convenient', -0.0),
('Food_drink', -0.0),
('Gate_location', -0.0),
('Inflightwifi_service', 0.0),
('Inflight_entertainment', 0.5292057512121757),
('Online_support', 0.0),
('Ease_of_Onlinebooking', 0.18763375309558566),
('Onboard_service', 0.15057828409841825),
('Leg_room_service', 0.10751052878768497),
('Baggage_handling', 0.03206291440972926),
('Checkin_service', 0.09325684522015301),
('Cleanliness', 0.0),
('Online_boarding', 0.0),
('Facilities_avg', 0.0),
('Online_service_avg', 0.11879708455249778),
('InFlight_avg', 0.0)]
data1 = data1.drop(
[
"DepartureDelayin_Mins",
"ArrivalDelayin_Mins",
"Flight_Distance",
"Departure.Arrival.time_convenient",
"Gate_location",
],
axis=1,
)
X = data1.drop(["Satisfaction"], axis=1)
y = data1["Satisfaction"]
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(54549, 20) (18184, 20) (18184, 20)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 54549 Number of rows in validation data = 18184 Number of rows in test data = 18184
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(54549, 21) (18184, 21) (18184, 21)
X_train.head()
| CustomerType | Age | TypeTravel | Seat_comfort | Food_drink | Inflightwifi_service | Inflight_entertainment | Online_support | Ease_of_Onlinebooking | Onboard_service | ... | Baggage_handling | Checkin_service | Cleanliness | Online_boarding | Facilities_avg | Online_service_avg | InFlight_avg | Gender_0 | Class_1 | Class_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34746 | 0.000 | 34.000 | 0.000 | 2 | 3.000 | 3 | 3 | 3 | 3 | 2.200 | ... | 3 | 1 | 4 | 3 | 2.000 | 3.000 | 3.000 | 1 | 0 | 0 |
| 15254 | 1.000 | 23.000 | 0.800 | 3 | 3.000 | 4 | 1 | 2 | 5 | 5.000 | ... | 5 | 4 | 5 | 3 | 4.000 | 3.000 | 3.000 | 0 | 1 | 0 |
| 8170 | 1.000 | 51.000 | 1.000 | 2 | 2.000 | 1 | 2 | 1 | 1 | 4.000 | ... | 4 | 2 | 3 | 1 | 4.000 | 1.000 | 2.000 | 1 | 0 | 1 |
| 69801 | 0.800 | 34.000 | 0.000 | 4 | 4.000 | 1 | 3 | 3 | 4 | 4.000 | ... | 4 | 3 | 4 | 5 | 4.000 | 4.000 | 3.000 | 1 | 0 | 0 |
| 58026 | 1.000 | 53.000 | 0.000 | 3 | 3.000 | 4 | 4 | 4 | 3 | 3.000 | ... | 3 | 1 | 3 | 2 | 2.000 | 3.000 | 3.000 | 1 | 0 | 0 |
5 rows × 21 columns
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
t0 = time.time()
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
roc_auc = roc_auc_score(target, pred)
time_taken = time.time() - t0
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
"ROC/AUC": roc_auc,
"Time_taken": time_taken,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic regression: 84.65634188822668 Bagging: 93.04662308996721 Random forest: 94.36964248837005 GBM: 92.43368287199249 Adaboost: 89.88143014972006 Xgboost: 94.6074440169235 dtree: 92.63129349345682 Validation Performance: Logistic regression: 0.8526929260450161 Bagging: 0.930064308681672 Random forest: 0.9437299035369775 GBM: 0.9265474276527331 Adaboost: 0.8977090032154341 Xgboost: 0.9441318327974276 dtree: 0.9310691318327974
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,300,50),'scale_pos_weight':[0,1,2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05], 'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.05, 'gamma': 5} with CV score=0.9916934529394865:
Wall time: 3min 34s
tuned_xgb1 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=10,
n_estimators=200,
learning_rate=0.05,
gamma=5,
)
tuned_xgb1.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.05, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=12,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=10, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
# Checking model's performance on training set
xgb_train1 = model_performance_classification_sklearn(tuned_xgb1, X_train, y_train)
xgb_train1
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.898 | 0.995 | 0.845 | 0.914 | 0.888 | 0.134 |
# Checking model's performance on validation set
xgb_val1 = model_performance_classification_sklearn(tuned_xgb1, X_val, y_val)
xgb_val1
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.894 | 0.992 | 0.843 | 0.911 | 0.884 | 0.040 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_xgb1, X_val, y_val)
%%time
# defining model
Model2 = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid2={'n_estimators':np.arange(150,250,50),'scale_pos_weight':[9,10,11],
'learning_rate':[0.03,0.05,0.06], 'gamma':[4,5,6],
'subsample':[0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
gridsearch_cv = GridSearchCV(Model2, param_grid2, scoring=scorer, cv=5)
#Fitting parameters in GridSearchCV
gridsearch_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(gridsearch_cv.best_params_,gridsearch_cv.best_score_))
Best parameters are {'gamma': 5, 'learning_rate': 0.06, 'n_estimators': 150, 'scale_pos_weight': 11, 'subsample': 1} with CV score=0.9926982873125441:
Wall time: 24min 6s
tuned_xgb2 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=1,
scale_pos_weight=11,
n_estimators=150,
learning_rate=0.06,
gamma=5,
)
tuned_xgb2.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.06, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=150, n_jobs=12,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=11, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Checking model's performance on training set
xgb_train2 = model_performance_classification_sklearn(tuned_xgb2, X_train, y_train)
xgb_train2
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.890 | 0.995 | 0.835 | 0.908 | 0.879 | 0.123 |
# Checking model's performance on validation set
xgb_val2 = model_performance_classification_sklearn(tuned_xgb2, X_val, y_val)
xgb_val2
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.885 | 0.991 | 0.832 | 0.904 | 0.874 | 0.037 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_xgb2, X_val, y_val)
%%time
# defining model
Model3 = BaggingClassifier(random_state=0, bootstrap=True)
cl1 = DecisionTreeClassifier(
class_weight={0: 0.45, 1: 0.55}, max_depth=8, random_state=1
)
# Parameter grid to pass in RandomSearchCV
param_grid3 = {
"base_estimator": [cl1],
"n_estimators": [5, 7, 15, 45, 130, 145, 158, 188, 200],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv3 = RandomizedSearchCV(
estimator=Model3,
param_distributions=param_grid3,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv3.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv3.best_params_, randomized_cv3.best_score_
)
)
Best parameters are {'n_estimators': 15, 'max_features': 1, 'base_estimator': DecisionTreeClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=8,
random_state=1)} with CV score=0.9744775512992412:
Wall time: 2min 32s
tuned_bagging3 = BaggingClassifier(
random_state=1,
bootstrap=True,
base_estimator=DecisionTreeClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=8),
n_estimators=15,
max_features=1,
)
tuned_bagging3.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.45,
1: 0.55},
max_depth=8),
max_features=1, n_estimators=15, random_state=1)
# Checking model's performance on training set
bagging_train3 = model_performance_classification_sklearn(
tuned_bagging3, X_train, y_train
)
bagging_train3
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.614 | 0.982 | 0.588 | 0.736 | 0.575 | 0.089 |
# Checking model's performance on validation set
bagging_val3 = model_performance_classification_sklearn(tuned_bagging3, X_val, y_val)
bagging_val3
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.617 | 0.983 | 0.590 | 0.738 | 0.579 | 0.033 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_bagging3, X_val, y_val)
%%time
# defining model
Model4 = BaggingClassifier(random_state=0, bootstrap=True)
cl2 = DecisionTreeClassifier(
class_weight={0: 0.45, 1: 0.55}, max_depth=8, random_state=1
)
# Parameter grid to pass in RandomSearchCV
param_grid4 = {
"base_estimator": [cl2],
"n_estimators": [5,8,12,15,18,20],
"max_features": [0.9, 1,1.2],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchearchCV
gridsearch_cv4 = GridSearchCV(
estimator=Model4,
param_grid=param_grid4,
scoring=scorer,
cv=5,
)
# Fitting parameters in RandomizedSearchCV
gridsearch_cv4.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
gridsearch_cv4.best_params_, gridsearch_cv4.best_score_
)
)
Best parameters are {'base_estimator': DecisionTreeClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=8,
random_state=1), 'max_features': 1, 'n_estimators': 15} with CV score=0.9744775512992412:
Wall time: 35.5 s
tuned_bagging4 = BaggingClassifier(
random_state=1,
bootstrap=True,
base_estimator=DecisionTreeClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=8),
n_estimators=15,
max_features=1,
)
tuned_bagging4.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.45,
1: 0.55},
max_depth=8),
max_features=1, n_estimators=15, random_state=1)
# Checking model's performance on training set
bagging_train4 = model_performance_classification_sklearn(
tuned_bagging4, X_train, y_train
)
bagging_train4
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.614 | 0.982 | 0.588 | 0.736 | 0.575 | 0.089 |
# Checking model's performance on validation set
bagging_val4 = model_performance_classification_sklearn(tuned_bagging4, X_val, y_val)
bagging_val4
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.617 | 0.983 | 0.590 | 0.738 | 0.579 | 0.034 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_bagging4, X_val, y_val)
%%time
# Choose the type of classifier.
Model5 = RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1)
# Grid of parameters to choose from
param_grid5 = {
"max_depth": [3,5,7,9,15],
"n_estimators": [50,60,100,120],
"max_features": [0.88,0.92,'sqrt','log2','auto'],
"min_samples_split": [4,5,7,11,20],
"max_samples": [0.7,0.9,None],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
# Calling RandomizedSearchCV
randomized_cv5 = RandomizedSearchCV(
estimator=Model5,
param_distributions=param_grid5,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=3,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv5.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv5.best_params_, randomized_cv5.best_score_
)
)
Best parameters are {'n_estimators': 120, 'min_samples_split': 20, 'max_samples': 0.7, 'max_features': 0.92, 'max_depth': 15} with CV score=0.9483855841371919:
Wall time: 1min 18s
tuned_rf5 = RandomForestClassifier(
n_estimators=120,
min_samples_split=20,
max_samples=0.7,
max_features=0.92,
max_depth=15,
class_weight={0: 0.45, 1: 0.55},
random_state=1,
)
tuned_rf5.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=15,
max_features=0.92, max_samples=0.7, min_samples_split=20,
n_estimators=120, random_state=1)
# Checking model's performance on training set
rf_train5 = model_performance_classification_sklearn(tuned_rf5, X_train, y_train)
rf_train5
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.960 | 0.964 | 0.963 | 0.963 | 0.959 | 0.770 |
# Checking model's performance on validation set
rf_val5 = model_performance_classification_sklearn(tuned_rf5, X_val, y_val)
rf_val5
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.940 | 0.944 | 0.946 | 0.945 | 0.939 | 0.261 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf5, X_val, y_val)
%%time
# Choose the type of classifier.
Model6 = RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1)
# Grid of parameters to choose from
param_grid5 = {
"max_depth": [12,15,18,20],
"n_estimators": [100,120,135,140],
"max_features": [0.88,0.92,0.95],
"min_samples_split": [18,20,22,25],
"max_samples": [0.6,0.7,0.8],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
grid_cv6= GridSearchCV(Model6, param_grid5, scoring='recall',cv=3)
# Fitting parameters in GridSearchCV
grid_cv6.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv6.best_params_, grid_cv6.best_score_
)
)
Best parameters are {'max_depth': 20, 'max_features': 0.88, 'max_samples': 0.8, 'min_samples_split': 18, 'n_estimators': 140} with CV score=0.949825830653805:
Wall time: 4h 38min 19s
tuned_rf6 = RandomForestClassifier(
n_estimators=140,
min_samples_split=18,
max_samples=0.8,
max_features=0.88,
max_depth=20,
class_weight={0: 0.45, 1: 0.55},
random_state=1,
)
tuned_rf6.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.45, 1: 0.55}, max_depth=20,
max_features=0.88, max_samples=0.8, min_samples_split=18,
n_estimators=140, random_state=1)
# Checking model's performance on training set
rf_train6 = model_performance_classification_sklearn(tuned_rf6, X_train, y_train)
rf_train6
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.967 | 0.970 | 0.970 | 0.970 | 0.967 | 0.922 |
# Checking model's performance on validation set
rf_val6 = model_performance_classification_sklearn(tuned_rf6, X_val, y_val)
rf_val6
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.942 | 0.946 | 0.949 | 0.947 | 0.942 | 0.322 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf6, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[ xgb_train1.T, xgb_train2.T, bagging_train3.T, bagging_train4.T, rf_train5.T, rf_train6.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Xgboost Tuned with Random Search",
"Xgboost Tuned with Grid Search",
"Bagging Tuned with Random Search",
"Bagging Tuned with Grid Search",
"Random Forest Tuned with Random Search",
"Random Forest with Grid Search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Xgboost Tuned with Random Search | Xgboost Tuned with Grid Search | Bagging Tuned with Random Search | Bagging Tuned with Grid Search | Random Forest Tuned with Random Search | Random Forest with Grid Search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.898 | 0.890 | 0.614 | 0.614 | 0.960 | 0.967 |
| Recall | 0.995 | 0.995 | 0.982 | 0.982 | 0.964 | 0.970 |
| Precision | 0.845 | 0.835 | 0.588 | 0.588 | 0.963 | 0.970 |
| F1 | 0.914 | 0.908 | 0.736 | 0.736 | 0.963 | 0.970 |
| ROC/AUC | 0.888 | 0.879 | 0.575 | 0.575 | 0.959 | 0.967 |
| Time_taken | 0.134 | 0.123 | 0.089 | 0.089 | 0.742 | 0.922 |
# training performance comparison
models_val_comp_df = pd.concat(
[
xgb_val1.T,
xgb_val2.T,
bagging_val3.T,
bagging_val4.T,
rf_val5.T,
rf_val6.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Xgboost Tuned with Random Search",
"Xgboost Tuned with Grid Search",
"Bagging Tuned with Random Search",
"Bagging Tuned with Grid Search",
"Random Forest Tuned with Random Search",
"Random Forest with Grid Search",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Xgboost Tuned with Random Search | Xgboost Tuned with Grid Search | Bagging Tuned with Random Search | Bagging Tuned with Grid Search | Random Forest Tuned with Random Search | Random Forest with Grid Search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.894 | 0.885 | 0.617 | 0.617 | 0.940 | 0.942 |
| Recall | 0.992 | 0.991 | 0.983 | 0.983 | 0.944 | 0.946 |
| Precision | 0.843 | 0.832 | 0.590 | 0.590 | 0.946 | 0.949 |
| F1 | 0.911 | 0.904 | 0.738 | 0.738 | 0.945 | 0.947 |
| ROC/AUC | 0.884 | 0.874 | 0.579 | 0.579 | 0.939 | 0.942 |
| Time_taken | 0.040 | 0.037 | 0.033 | 0.034 | 0.258 | 0.322 |
# Calculating different metrics on the test set
rf_test5 = model_performance_classification_sklearn(tuned_rf5, X_test, y_test)
print("Test performance:")
rf_test5
Test performance:
| Accuracy | Recall | Precision | F1 | ROC/AUC | Time_taken | |
|---|---|---|---|---|---|---|
| 0 | 0.939 | 0.945 | 0.944 | 0.945 | 0.939 | 0.259 |
# creating confusion matrix
confusion_matrix_sklearn(tuned_rf5, X_test, y_test)
feature_names = X_train.columns
importances = tuned_rf5.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Random Forest Randomized Search CV Relative Importance")
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(tuned_rf5.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Inflight_entertainment 0.406 Seat_comfort 0.210 Online_service_avg 0.068 Ease_of_Onlinebooking 0.034 TypeTravel 0.031 CustomerType 0.026 Gender_0 0.025 Inflightwifi_service 0.023 Leg_room_service 0.020 Age 0.020 Cleanliness 0.019 Food_drink 0.019 Online_boarding 0.019 Checkin_service 0.018 Baggage_handling 0.015 Onboard_service 0.013 Class_1 0.013 Online_support 0.010 Facilities_avg 0.007 Class_2 0.003 InFlight_avg 0.002